Let’s Plot 3: Base pair resolution NGS (exome) coverage plots - Part 2

Feb 8, 2018 9 min read bioinformatics, Let’s Plot, R

Introduction
Call mosdepth on bam to calculate bp-specific read depth
Intersect base pair depth info with transcript and exon number
Now it’s R time!

Introduction

This is a barebones (but detailed enough, I hope) discussion of how to take a bam file, extract base pair resolution coverage data, then finagle the data into coverage plots by gene and exon. No data will be given for the below code. I’m not sharing a bam file. Also, not much point to sharing the bed, HGNC name, and gtf file, as there’s a decent chance they won’t work for your bam.

Call mosdepth on bam to calculate bp-specific read depth

mosdepth, by default, will generate base pair resolution coverage across the entire genome. Another fantastic tool from Brent Peterson and Aaron Quinlan (some point I’ll do a gushy post on all of the useful tools Quinlan and company have made).

mosdepth will run very quickly (minutes instead of hours), compared to bedtools genomecov.

# bash 
sinteractive --cpus-per-task 16
module load mosdepth
mosdepth -t 16 41001412010527 41001412010527_realigned_recal.bam &

Intersect base pair depth info with transcript and exon number

The intersect is to select regions overlapping exons and to label them with the transcript name and exon number present in gencode_genes_v27lift37.codingExons.ensembl.bed.gz. See the code below for how to make your own.

# bash

# gencode_genes_v27lift37.codingExons.bed was downloaded from the UCSC table browser from genocde gene v27lift37 and 'coding exons' with 0 padding were selected as the output for the bed file
# my https://github.com/davemcg/ChromosomeMappings/ convert_notation.py script was then used to convert the UCSC notation in ensembl notation, which my bam uses
# files in biowulf2:/data/mcgaugheyd/genomes/GRCh37/
module load bedtools
~/git/ChromosomeMappings/convert_notation.py -c ~/git/ChromosomeMappings/GRCh37_gencode2ensembl.txt -f gencode_genes_v27lift37.codingExons.bed | sort -k1,1 -k2,2n | gzip > gencode_genes_v27lift37.codingExons.ensembl.bed.gz
bedtools intersect -wa -wb -a 41001412010527.per-base.bed.gz -b /data/mcgaugheyd/genomes/GRCh37/gencode_genes_v27lift37.codingExons.ensembl.bed.gz | bgzip  > 41001412010527.per-base.labeled.bed.gz &

Now it’s R time!

Prepare Metadata

You’ll need HGNC <-> Ensembl Transcript converter and the Gencode GTF

The first file is used to match gene ‘names’ with ensembl transcript ID

The second file is used to semi-accurately pick the ‘canonical’ transcript for a gene (pick the appris transcript, then the longest)

library(data.table)
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()

library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

library(stringr)
metadata <- fread('gzcat ~/GenomicData/gencode.v27lift37.metadata.HGNC.gz', header=F)
colnames(metadata) <- c('Transcript','Name')
gencode <- fread('gzcat ~/GenomicData/gencode.v27lift37.basic.annotation.gtf.gz', header=F, skip=4)
gencode2 <- gencode %>% filter(V3=='transcript') %>% 
  filter(grepl('appris_principal', V9)) %>% 
  rowwise() %>%  
  mutate(Transcript = str_extract(V9,'ENST\\d{11}\\.\\d+'), # use regex to graph ensembl transcript 
         Gene = str_extract(V9, 'gene_name\\s\\".*?;'), # and the gene name
         Size=V5-V4) %>% 
  separate(Gene, c('skip','Name','skip2'),'\"') %>% # now you have to remove the 'gene name' part  
  select(-skip, -skip2) %>% 
  group_by(Name) %>% top_n(1)

## Selecting by Size

Load mosdepth / bedtools intersect data and prep

Label coverage chunks with the depth of their coverage with case_when and extract the transcript name and exon number with a bunch of separate commands

depth_data <- fread('gzcat ~/Desktop/41001412010527.per-base.labeled.bed.gz')
dd_processed <- depth_data %>% 
  separate(V8, c('Transcript','Rest'), '_cds_') %>% 
  separate(Rest, c('Before','Stuff'),'_chr') %>% 
  separate(Before, c('Exon Number','Num2'),sep='_') %>% 
  mutate(Depth = case_when(V4 < 10 ~ '< 10 Reads', 
                           V4 < 20 ~ '< 20 Reads', 
                           TRUE ~ '>= 20 Reads')) %>% 
  mutate(Depth=factor(Depth, levels=c('< 10 Reads','< 20 Reads','>= 20 Reads'))) %>% 
  mutate(Transcript=case_when(grepl('_',Transcript) ~ gsub('_.','',Transcript), 
                              TRUE ~ Transcript)) %>% 
  select(Chr=V1, Start=V2, End=V3, Read_Depth=V4, Transcript, Strand=V10, Depth, `Exon Number`, ExonStart=V6, ExonEnd=V7)

dd_processed <- left_join(dd_processed, metadata, by=c('Transcript'))

Plot Maker, version 1

Faceted by exon. One plot per gene and using cowplot to glue them together

genes <- c('PAX6','ABCA4','NRL','CRX','RPGR')
transcripts = gencode2 %>% filter(Name %in% genes) %>% pull(Transcript)

# set a custom color that will work even if a category is missing
scale_colour_custom <- function(...){
  ggplot2:::manual_scale('colour', 
                         values = setNames(c('darkred', 'red', 'black'),
                                           c('< 10 Reads','< 20 Reads','>= 20 Reads')), 
                         ...)
}

plot_maker <- function(tx){
  num_of_exons <- dd_processed %>% filter(Transcript==tx) %>% pull(`Exon Number`) %>% as.numeric() %>% max()
  gene_name <-  dd_processed %>% filter(Transcript==tx) %>% pull(Name) %>% unique()
  # expand to create a row for each sequence and fill in previous values
  dd_processed %>% filter(Transcript==tx) %>% group_by(`Exon Number`) %>% 
    expand(Start=full_seq(c(Start,End),1)) %>% 
    left_join(.,  dd_processed %>% filter(Transcript==tx)) %>% # create one row per base position, grouped by Exon Number https://stackoverflow.com/questions/42866119/fill-missing-values-in-data-frame-using-dplyr-complete-within-groups
    fill(Chr:Name) %>% # fill missing values https://stackoverflow.com/questions/40040834/r-replace-na-with-previous-or-next-value-by-group-using-dplyr
    ungroup() %>% # drop the exon number grouping, so I can mutate below
    mutate(`Exon Number`= factor(`Exon Number`,levels=0:num_of_exons)) %>% # Ah, reordering. I need it to be a factor, but then I have to explicitly give the order   
    mutate(Depth = factor(Depth, levels=c('< 10 Reads','< 20 Reads','>= 20 Reads'))) %>%  # create three categories for coloring
    ggplot(aes(x=Start, xend=End, y=Read_Depth, yend=Read_Depth, colour=Depth)) + 
    facet_wrap(~`Exon Number`, scales = 'free_x', nrow=1, strip.position = 'bottom') + 
    geom_point(size=0.1) + theme_minimal()+ scale_colour_custom() +  # use my custom color set above for my three categories
    theme(axis.text.x=element_blank(), 
          axis.ticks.x = element_blank(), 
          panel.grid.minor = element_blank(), 
          panel.grid.major.x = element_blank(),
          legend.position = 'none') + 
    ylab('Depth') + 
    xlab(paste0(gene_name[1]))
}

# little for loop to roll through the transcripts and make plot, storing in a list
plots <- list()
for (i in transcripts){
  plots[[i]] <- plot_maker(i)
}

## Joining, by = c("Exon Number", "Start")
## Joining, by = c("Exon Number", "Start")
## Joining, by = c("Exon Number", "Start")
## Joining, by = c("Exon Number", "Start")
## Joining, by = c("Exon Number", "Start")

legend <- get_legend(plots[[names(plots)[1]]] + theme(legend.position='right'))
# cowplot can take a list and glue the plots together
# I'm commenting out the below line because blogdown makes it too damn small. You need to run it to make the plot
# plot_grid(plot_grid(plotlist = plots, ncol=1, hjust=-2), legend, ncol=2, rel_widths = c(5,0.5))

Version 2

A bit tighter. Recalculates coordinates to glue all of the exons together in one plot. I can facet by gene. A bit harder to read, but is more accurate as the exons and gene lengths are proportional

genes <- c('PAX6','ABCA4','NRL','CRX','RPGR')
tx = gencode2 %>% filter(Name %in% genes) %>% pull(Transcript)

dd_expanded <- dd_processed %>% filter(Transcript %in% tx) %>% group_by(Transcript, `Exon Number`) %>% 
  expand(Start=full_seq(c(Start,End),1)) %>% 
  left_join(.,  dd_processed %>% filter(Transcript %in% tx)) %>% # create one row per base position, grouped by Exon Number https://stackoverflow.com/questions/42866119/fill-missing-values-in-data-frame-using-dplyr-complete-within-groups
  fill(Chr:Name) # fill missing values https://stackoverflow.com/questions/40040834/r-replace-na-with-previous-or-next-value-by-group-using-dplyr

## Joining, by = c("Transcript", "Exon Number", "Start")

dd_expanded <- dd_expanded %>% group_by(Name) %>% mutate(Pos = 1:n())
even_odds_marking <- dd_expanded %>% group_by(Name, `Exon Number`) %>% summarise(Start=min(Pos), End=max(Pos)) %>% mutate(Exon = case_when(as.numeric(`Exon Number`) %% 2 == 0 ~ 'even', TRUE ~ 'odd'))
plot_data<-bind_rows(dd_expanded,even_odds_marking)

ggplot() + 
  geom_point(data =  plot_data %>% filter(is.na(Exon)), aes(x=Pos, y=Read_Depth, colour=Depth),size=0.1)  + 
  facet_wrap(~Name, ncol=1) + 
  geom_rect(data = plot_data %>% filter(!is.na(Exon)), aes(xmin=Start, xmax=End, ymin=-Inf, ymax=Inf, fill=Exon)) +  
  scale_fill_manual(values = alpha(c("gray", "white"), .3)) +  
  scale_colour_custom() +
  theme_minimal() +  
  theme(axis.text.x=element_blank(), 
        axis.ticks.x = element_blank(),
        panel.grid.minor = element_blank(), 
        panel.grid.major.x = element_blank())+
  guides(fill=FALSE) + 
  ylab('Read Depth')

sessionInfo()

devtools::session_info()

## Session info -------------------------------------------------------------

##  setting  value                       
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-07-16

## Packages -----------------------------------------------------------------

##  package    * version date       source                           
##  assertthat   0.2.0   2017-04-11 CRAN (R 3.5.0)                   
##  backports    1.1.2   2017-12-13 CRAN (R 3.5.0)                   
##  base       * 3.5.0   2018-04-24 local                            
##  bindr        0.1.1   2018-03-13 CRAN (R 3.5.0)                   
##  bindrcpp   * 0.2.2   2018-03-29 CRAN (R 3.5.0)                   
##  blogdown     0.8.1   2018-07-16 Github (rstudio/blogdown@d54c39a)
##  bookdown     0.7     2018-02-18 CRAN (R 3.5.0)                   
##  broom        0.4.4   2018-03-29 CRAN (R 3.5.0)                   
##  cellranger   1.1.0   2016-07-27 CRAN (R 3.5.0)                   
##  cli          1.0.0   2017-11-05 CRAN (R 3.5.0)                   
##  colorspace   1.3-2   2016-12-14 CRAN (R 3.5.0)                   
##  compiler     3.5.0   2018-04-24 local                            
##  cowplot    * 0.9.2   2017-12-17 CRAN (R 3.5.0)                   
##  crayon       1.3.4   2017-09-16 CRAN (R 3.5.0)                   
##  data.table * 1.11.4  2018-05-27 cran (@1.11.4)                   
##  datasets   * 3.5.0   2018-04-24 local                            
##  devtools     1.13.5  2018-02-18 CRAN (R 3.5.0)                   
##  digest       0.6.15  2018-01-28 CRAN (R 3.5.0)                   
##  dplyr      * 0.7.6   2018-06-29 cran (@0.7.6)                    
##  evaluate     0.10.1  2017-06-24 CRAN (R 3.5.0)                   
##  forcats    * 0.3.0   2018-02-19 CRAN (R 3.5.0)                   
##  foreign      0.8-70  2017-11-28 CRAN (R 3.5.0)                   
##  ggplot2    * 3.0.0   2018-07-03 cran (@3.0.0)                    
##  glue         1.2.0   2017-10-29 CRAN (R 3.5.0)                   
##  graphics   * 3.5.0   2018-04-24 local                            
##  grDevices  * 3.5.0   2018-04-24 local                            
##  grid         3.5.0   2018-04-24 local                            
##  gtable       0.2.0   2016-02-26 CRAN (R 3.5.0)                   
##  haven        1.1.1   2018-01-18 CRAN (R 3.5.0)                   
##  hms          0.4.2   2018-03-10 CRAN (R 3.5.0)                   
##  htmltools    0.3.6   2017-04-28 CRAN (R 3.5.0)                   
##  httr         1.3.1   2017-08-20 CRAN (R 3.5.0)                   
##  jsonlite     1.5     2017-06-01 CRAN (R 3.5.0)                   
##  knitr        1.20    2018-02-20 CRAN (R 3.5.0)                   
##  labeling     0.3     2014-08-23 CRAN (R 3.5.0)                   
##  lattice      0.20-35 2017-03-25 CRAN (R 3.5.0)                   
##  lazyeval     0.2.1   2017-10-29 CRAN (R 3.5.0)                   
##  lubridate    1.7.4   2018-04-11 CRAN (R 3.5.0)                   
##  magrittr     1.5     2014-11-22 CRAN (R 3.5.0)                   
##  memoise      1.1.0   2017-04-21 CRAN (R 3.5.0)                   
##  methods    * 3.5.0   2018-04-24 local                            
##  mnormt       1.5-5   2016-10-15 CRAN (R 3.5.0)                   
##  modelr       0.1.2   2018-05-11 CRAN (R 3.5.0)                   
##  munsell      0.4.3   2016-02-13 CRAN (R 3.5.0)                   
##  nlme         3.1-137 2018-04-07 CRAN (R 3.5.0)                   
##  parallel     3.5.0   2018-04-24 local                            
##  pillar       1.2.3   2018-05-25 CRAN (R 3.5.0)                   
##  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.5.0)                   
##  plyr         1.8.4   2016-06-08 CRAN (R 3.5.0)                   
##  psych        1.8.4   2018-05-06 CRAN (R 3.5.0)                   
##  purrr      * 0.2.4   2017-10-18 CRAN (R 3.5.0)                   
##  R6           2.2.2   2017-06-17 CRAN (R 3.5.0)                   
##  Rcpp         0.12.17 2018-05-18 CRAN (R 3.5.0)                   
##  readr      * 1.1.1   2017-05-16 CRAN (R 3.5.0)                   
##  readxl       1.1.0   2018-04-20 CRAN (R 3.5.0)                   
##  reshape2     1.4.3   2017-12-11 CRAN (R 3.5.0)                   
##  rlang        0.2.1   2018-05-30 CRAN (R 3.5.0)                   
##  rmarkdown    1.10    2018-06-11 cran (@1.10)                     
##  rprojroot    1.3-2   2018-01-03 CRAN (R 3.5.0)                   
##  rstudioapi   0.7     2017-09-07 CRAN (R 3.5.0)                   
##  rvest        0.3.2   2016-06-17 CRAN (R 3.5.0)                   
##  scales       0.5.0   2017-08-24 CRAN (R 3.5.0)                   
##  stats      * 3.5.0   2018-04-24 local                            
##  stringi      1.2.2   2018-05-02 CRAN (R 3.5.0)                   
##  stringr    * 1.3.1   2018-05-10 CRAN (R 3.5.0)                   
##  tibble     * 1.4.2   2018-01-22 CRAN (R 3.5.0)                   
##  tidyr      * 0.8.1   2018-05-18 CRAN (R 3.5.0)                   
##  tidyselect   0.2.4   2018-02-26 CRAN (R 3.5.0)                   
##  tidyverse  * 1.2.1   2017-11-14 CRAN (R 3.5.0)                   
##  tools        3.5.0   2018-04-24 local                            
##  utils      * 3.5.0   2018-04-24 local                            
##  withr        2.1.2   2018-03-15 CRAN (R 3.5.0)                   
##  xfun         0.3     2018-07-06 cran (@0.3)                      
##  xml2         1.2.0   2018-01-24 CRAN (R 3.5.0)                   
##  yaml         2.1.19  2018-05-01 CRAN (R 3.5.0)

bioinformatics ggplot2 Let's Plot R quinlanverse