Seurat FindMarker with Cluster N vs M

What

Easy cluster by cluster Seurat FindMarkers implementation

Why

Because Seurat’s FindMarkers (which can be parallelized if you also load library(Future) and plan("multiprocess")) runs with cluster N against all other clusters.

People kept asking me for “well what about cluster 23 vs 17” and I kept saying “uh, I haven’t run that because…”

How

This is being done a Mac. This may not work on a PC. Multicore stuffs are complicated. See https://github.com/HenrikBengtsson/future/issues/299

First we load libraries

library(future)
library(tictoc) # easy benchmarking
library(Seurat) # I'm assuming you are using version 3
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
plan(strategy = "multisession", workers = 4)
options(future.globals.maxSize = 500000 * 1024^2) # unnecessary for this toy example but future, by default, limits the size of the obj to 500mb. This examples to 500GB. Obviously you can get into trouble very quickly on real data as the object will get copied over and over for each parallel run.

# build in seurat object
pbmc_small
## An object of class Seurat 
## 230 features across 80 samples within 1 assay 
## Active assay: RNA (230 features)
##  2 dimensional reductions calculated: pca, tsne

Cluster M vs N

Okay now let’s do each cluster vs each cluster.

This seurat object has three clusters: 0, 1, 2

Instead of 0 vs 1,2 and 1 vs 0,2, and 2 vs 0,1, I want 0 vs 1, 0 vs 2, 1 vs 2.

# get cluster names
cluster_names <- Idents(pbmc_small) %>%  
  unique() %>% 
  as.character() %>% 
  sort()

The combn function makes all of the pairwise comparisons.

calc_diff <- function(i, obj){
  first <- combn(cluster_names,2)[1,i]
  second <- combn(cluster_names,2)[2,i]
  comparison = paste0(first,'__',second)
  print(comparison)
  out <- FindMarkers(obj, ident.1 = first, ident.2 = second, logfc.threshold = 0.5, verbose = FALSE)
  out$comparison <- comparison
  out
}

Use map to run all the combinations

tic()
de_rna <- map(1:ncol(combn(cluster_names,2)), calc_diff, pbmc_small)
## [1] "0__1"
## [1] "0__2"
## [1] "1__2"
toc()
## 20.392 sec elapsed

List is returned, but very easy to make one obj

Use str to show compasition of the de_rna object.

We use map again to moves the gene names from the row.names to a column.

Then bind_rows to collapse the list of data.frames in one data frame.

We pick 10 random rows to show

str(de_rna)
## List of 3
##  $ :'data.frame':    195 obs. of  6 variables:
##   ..$ p_val     : num [1:195] 1.70e-11 4.47e-11 5.33e-11 7.00e-11 3.29e-10 ...
##   ..$ avg_logFC : num [1:195] -2.54 -2.55 -4.04 -3.08 -3.33 ...
##   ..$ pct.1     : num [1:195] 0.111 0.306 0.111 0.417 0.083 0.083 0.194 0.139 0.111 0.194 ...
##   ..$ pct.2     : num [1:195] 1 1 0.96 1 0.88 0.88 0.88 0.96 0.84 1 ...
##   ..$ p_val_adj : num [1:195] 3.92e-09 1.03e-08 1.23e-08 1.61e-08 7.56e-08 ...
##   ..$ comparison: chr [1:195] "0__1" "0__1" "0__1" "0__1" ...
##  $ :'data.frame':    161 obs. of  6 variables:
##   ..$ p_val     : num [1:161] 9.15e-11 2.46e-09 3.54e-09 6.11e-09 2.02e-08 ...
##   ..$ avg_logFC : num [1:161] -4.45 -3.24 -4.14 -3.53 -3.32 ...
##   ..$ pct.1     : num [1:161] 0.083 0.111 0.083 0.028 0.056 0.417 0.028 0.028 0 0 ...
##   ..$ pct.2     : num [1:161] 0.947 0.895 0.842 0.789 0.789 0.895 0.632 0.632 0.526 0.474 ...
##   ..$ p_val_adj : num [1:161] 2.10e-08 5.66e-07 8.14e-07 1.41e-06 4.65e-06 ...
##   ..$ comparison: chr [1:161] "0__2" "0__2" "0__2" "0__2" ...
##  $ :'data.frame':    166 obs. of  6 variables:
##   ..$ p_val     : num [1:166] 2.04e-08 4.00e-08 3.14e-07 3.17e-07 1.15e-06 ...
##   ..$ avg_logFC : num [1:166] 5.35 2.4 4.6 2 1.79 ...
##   ..$ pct.1     : num [1:166] 0.96 1 0.88 1 1 1 0.84 1 1 0.72 ...
##   ..$ pct.2     : num [1:166] 0.053 0.263 0.105 0.316 0.263 0.421 0.158 0.474 0.421 0.053 ...
##   ..$ p_val_adj : num [1:166] 4.69e-06 9.19e-06 7.23e-05 7.30e-05 2.64e-04 ...
##   ..$ comparison: chr [1:166] "1__2" "1__2" "1__2" "1__2" ...
# Move Gene names from `rownames` to column
de_rna <- map(de_rna, rownames_to_column, 'Gene')
easy <- bind_rows(de_rna)
easy %>% sample_n(10)
##       Gene      p_val  avg_logFC pct.1 pct.2 p_val_adj comparison
## 1   ITGA2B 0.04457624  2.6794766 0.222 0.040         1       0__1
## 2    TAGAP 0.52702955 -1.0622216 0.200 0.105         1       1__2
## 3     XCL2 0.10780247 -2.2367768 0.000 0.105         1       1__2
## 4  TSC22D1 0.14045345  3.3077263 0.111 0.000         1       0__2
## 5    PTCRA 0.06464320  3.3644245 0.167 0.000         1       0__2
## 6  BLOC1S4 0.74153653  1.6190737 0.167 0.160         1       0__1
## 7    TTC38 0.19899409  0.7938259 0.194 0.053         1       0__2
## 8   FAM96A 0.07588120 -1.0820997 0.083 0.263         1       0__2
## 9     SSR2 0.26271072 -0.7004999 0.520 0.579         1       1__2
## 10    GYPC 0.39392929  1.0464468 0.389 0.316         1       0__2

Bonus parallelization

FindMarkers is already parallelized, but I found that parallelization across comparisons was about 15% faster with big Seurat objects. YMMV. This is stupid easy to implement with furrr. The only substantial difference is that I do library(furrr) and prepend future_ to map. The .progress option gives you an easy progress bar.

library(furrr)
tic()
de_rna <- future_map(1:ncol(combn(cluster_names,2)), calc_diff, pbmc_small, .progress = TRUE)
## 
 Progress: ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 100%
## 
## [1] "0__1"
## [1] "0__2"
## [1] "1__2"
toc()
## 0.814 sec elapsed

Related

comments powered by Disqus