What
Easy cluster by cluster Seurat FindMarkers
implementation
Why
Because Seurat’s FindMarkers
(which can be parallelized if you also load
library(Future)
and plan("multiprocess")
) runs with cluster N
against all
other clusters.
People kept asking me for “well what about cluster 23 vs 17” and I kept saying “uh, I haven’t run that because…”
How
This is being done a Mac. This may not work on a PC. Multicore stuffs are complicated. See https://github.com/HenrikBengtsson/future/issues/299
First we load libraries
library(future)
library(tictoc) # easy benchmarking
library(Seurat) # I'm assuming you are using version 3
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
plan(strategy = "multisession", workers = 4)
options(future.globals.maxSize = 500000 * 1024^2) # unnecessary for this toy example but future, by default, limits the size of the obj to 500mb. This examples to 500GB. Obviously you can get into trouble very quickly on real data as the object will get copied over and over for each parallel run.
# build in seurat object
pbmc_small
## An object of class Seurat
## 230 features across 80 samples within 1 assay
## Active assay: RNA (230 features)
## 2 dimensional reductions calculated: pca, tsne
Cluster M vs N
Okay now let’s do each cluster vs each cluster.
This seurat object has three clusters: 0, 1, 2
Instead of 0 vs 1,2 and 1 vs 0,2, and 2 vs 0,1, I want 0 vs 1, 0 vs 2, 1 vs 2.
# get cluster names
cluster_names <- Idents(pbmc_small) %>%
unique() %>%
as.character() %>%
sort()
The combn
function makes all of the pairwise comparisons.
calc_diff <- function(i, obj){
first <- combn(cluster_names,2)[1,i]
second <- combn(cluster_names,2)[2,i]
comparison = paste0(first,'__',second)
print(comparison)
out <- FindMarkers(obj, ident.1 = first, ident.2 = second, logfc.threshold = 0.5, verbose = FALSE)
out$comparison <- comparison
out
}
Use map
to run all the combinations
tic()
de_rna <- map(1:ncol(combn(cluster_names,2)), calc_diff, pbmc_small)
## [1] "0__1"
## [1] "0__2"
## [1] "1__2"
toc()
## 20.392 sec elapsed
List is returned, but very easy to make one obj
Use str
to show compasition of the de_rna
object.
We use map
again to moves the gene names from the row.names
to a column.
Then bind_rows
to collapse the list of data.frames in one data frame.
We pick 10 random rows to show
str(de_rna)
## List of 3
## $ :'data.frame': 195 obs. of 6 variables:
## ..$ p_val : num [1:195] 1.70e-11 4.47e-11 5.33e-11 7.00e-11 3.29e-10 ...
## ..$ avg_logFC : num [1:195] -2.54 -2.55 -4.04 -3.08 -3.33 ...
## ..$ pct.1 : num [1:195] 0.111 0.306 0.111 0.417 0.083 0.083 0.194 0.139 0.111 0.194 ...
## ..$ pct.2 : num [1:195] 1 1 0.96 1 0.88 0.88 0.88 0.96 0.84 1 ...
## ..$ p_val_adj : num [1:195] 3.92e-09 1.03e-08 1.23e-08 1.61e-08 7.56e-08 ...
## ..$ comparison: chr [1:195] "0__1" "0__1" "0__1" "0__1" ...
## $ :'data.frame': 161 obs. of 6 variables:
## ..$ p_val : num [1:161] 9.15e-11 2.46e-09 3.54e-09 6.11e-09 2.02e-08 ...
## ..$ avg_logFC : num [1:161] -4.45 -3.24 -4.14 -3.53 -3.32 ...
## ..$ pct.1 : num [1:161] 0.083 0.111 0.083 0.028 0.056 0.417 0.028 0.028 0 0 ...
## ..$ pct.2 : num [1:161] 0.947 0.895 0.842 0.789 0.789 0.895 0.632 0.632 0.526 0.474 ...
## ..$ p_val_adj : num [1:161] 2.10e-08 5.66e-07 8.14e-07 1.41e-06 4.65e-06 ...
## ..$ comparison: chr [1:161] "0__2" "0__2" "0__2" "0__2" ...
## $ :'data.frame': 166 obs. of 6 variables:
## ..$ p_val : num [1:166] 2.04e-08 4.00e-08 3.14e-07 3.17e-07 1.15e-06 ...
## ..$ avg_logFC : num [1:166] 5.35 2.4 4.6 2 1.79 ...
## ..$ pct.1 : num [1:166] 0.96 1 0.88 1 1 1 0.84 1 1 0.72 ...
## ..$ pct.2 : num [1:166] 0.053 0.263 0.105 0.316 0.263 0.421 0.158 0.474 0.421 0.053 ...
## ..$ p_val_adj : num [1:166] 4.69e-06 9.19e-06 7.23e-05 7.30e-05 2.64e-04 ...
## ..$ comparison: chr [1:166] "1__2" "1__2" "1__2" "1__2" ...
# Move Gene names from `rownames` to column
de_rna <- map(de_rna, rownames_to_column, 'Gene')
easy <- bind_rows(de_rna)
easy %>% sample_n(10)
## Gene p_val avg_logFC pct.1 pct.2 p_val_adj comparison
## 1 ITGA2B 0.04457624 2.6794766 0.222 0.040 1 0__1
## 2 TAGAP 0.52702955 -1.0622216 0.200 0.105 1 1__2
## 3 XCL2 0.10780247 -2.2367768 0.000 0.105 1 1__2
## 4 TSC22D1 0.14045345 3.3077263 0.111 0.000 1 0__2
## 5 PTCRA 0.06464320 3.3644245 0.167 0.000 1 0__2
## 6 BLOC1S4 0.74153653 1.6190737 0.167 0.160 1 0__1
## 7 TTC38 0.19899409 0.7938259 0.194 0.053 1 0__2
## 8 FAM96A 0.07588120 -1.0820997 0.083 0.263 1 0__2
## 9 SSR2 0.26271072 -0.7004999 0.520 0.579 1 1__2
## 10 GYPC 0.39392929 1.0464468 0.389 0.316 1 0__2
Bonus parallelization
FindMarkers
is already parallelized, but I found that parallelization across comparisons was about 15% faster with big Seurat objects. YMMV. This is stupid easy to implement with furrr
. The only substantial difference is that I do library(furrr)
and prepend future_
to map
. The .progress
option gives you an easy progress bar.
library(furrr)
tic()
de_rna <- future_map(1:ncol(combn(cluster_names,2)), calc_diff, pbmc_small, .progress = TRUE)
##
Progress: ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 100%
##
## [1] "0__1"
## [1] "0__2"
## [1] "1__2"
toc()
## 0.814 sec elapsed