Day 3
Very sparse and poorly written notes covering #GI2018.
Typos everywhere. Things may change dramatically over time as I scan back through notes.
I’ve tried to respect #notwitter. Will be updated periodically.
Speaker (Lab | Group)
Rafael Irizarry Keynote
Understanding variabliity in high throughput data
Typical workflow raw data -> pre-processing -> analysis -> discovery (false?????)
“When you find an unexpected result, be skeptical, check for systematic errors”
“Always, look at the data”
“Dynamite plots needs to die” (barplot + SE/SD)
- Plug for a blog post by me: http://davemcg.github.io/post/let-s-plot-4-r-vs-excel/
- Show the data (boxplot + individual points)
Showing batch effect example where huge diff in gene expression between groups is actually derived from technical issues (two groups were effectively done at different times)
Lin et al. 2014 (notorious paper comparing tx between human and mouse the claimed dif tissues within species more common to each other)
Yoav Gilad quickly pointed out that the mouse and human were done on different sequencers
So they re-did on same sequencer machine….got similar result
Not a new “result” (yanai 2004 had similar conclusion with microarray)
Rafael pointing out that high genes are high and low are low…always, so correlation will always be high.
Once you fix probe bias (microarray), then tissues cluster together.
But what about Lin RNA-seq?
A lot is explained by:
- number of transcripts
- GC content
Single Cell RNA-seq
- trying to disentangle batch effect from tumor type
- actually proportion of zeros is driving variability
- this is probably not biology
Do DNAm(ethylation) changes drive gene expression changes?
Discussing Ford et al. 2017 bioRxiv
“Distinguishing consistent differences from random ones is challenging”
Talking about how technical variability and biological variability are different and must be considered when designing experiment (moar replicates!)
Do we trust individual measurements?
- when you do lots of tests, individual tests can be wrong
- region test was done for CpG methylation (DMR test Jaffe 2012)
- can also shuffle data (shuffle samples, not points) and see how often a diff can occur by chance
- region + bootstrapping powerful approach (plus looking at replicates individually!!!)
Session 4: Transcriptomics, Alternative Splicing and Gene Predictions
Mark Robinson (Robinson)
On the analysis of long-read sequencing data for gene expression
Illumina <-> ONT (Oxford Nanopore) cheap <-> expensive short fragments <-> full length lower error rates <-> higher error rates
DataSet
- WT vs Srpk1-KO
- ONT (various preps) and Illumina
- 1M reads / sample (ONT); 30M reads / sample (Illumina)
25% mismatch with ONT vs <1% for Illumina (but they still map)
ONT doesn’t look useful for variation in sequence (way too noisy)
Do we actually get full length transcripts?
- for longer ones…not really
- showing how a longer transcript has worse coverage with ONT
Quantification ONT <-> Illumina
- looks pretty good for protein coding genes
- more discrepancy at the tx level (still OK)
ONT
- minimap2 + salmon current choice for tooling
ONT direct RNA or ONT cDNA???
- look pretty comparable
- direct RNA requires crazy input amounts, probably not worth it unless you are looking at RNA base modifications
- ONT has new stranded cDNA protocol / kit
Diff expression analysis
- once you account for way lower depth (again 1M vs 30M) for ONT, roughly comparable
Definitely not time to jump over for more standard DE type RNAseq tests
Hagen Tilgner (Tilgner)
Single-cell isoform RNA sequencing (ScISOr-Seq) across thousands of cells reveals isoforms of cerebellar cell types.
Need long-read RNA seq to confidently ID isoforms
ScISOr-Seq
- 10X uses 3’ sequencing
- Add PacBio ISO-Seq and find the 10X barcode
- blend 10X and PacBio data
Barcode ID (long read data can be noisy…)
- 99.99% specificity with PacBio
- Only 58% of molecules have barcode (74% in Illumina)
- not catastrophic, more annoying
It works
- validate some junctions with MS/MS
Plug for isoformAtlas.com
See some compelling cell type specific isoform usage
Nikka Keivanfar (Church, 10X)
Bootstrapping Biology: Quick and easy de novo genome assembly to enable single cell gene expression analysis
Linked-Reads + Supernova + Cactus/CAT
Supernova does assembly (with phase)
Supernova v2 can get N50 over 100+ kb for many species (HUman, Hummingbird, Drosophila)
- Zebrafish is tougher (only around 20kb)
CAT
- Comparative Annotation Toolkit
- “provides a flexible way to simultaneously annotate entire clades and identify orthology relationships”
- doi.org/10.1101/231118
- https://genome.cshlp.org/content/28/7/1029
“Blackjack” (Donkey) blood sequenced with 10X
- made assembly with Supernova
- better metrics than other donkey genomes
Used assembly (+CAT to get gene info) for scRNA-seq of the blood
- labels major blood types (T cells, B cells, NK, monocytes, plasma, etc)
Koen Van den Berge (Clement)
Discrete and continuous differential expression analysis for single-cell RNA-seq data
scRNA-seq data is WAY noiser than bulk RNA-seq (at least cell to cell)
Bulk RNAseq generally use negative binomial models (edgeR, DESeq2)
Sort of surprisingly, these bulk RNAseq methods seem to work well in scRNA
But zeros…..showing simulations where adding zeros getting weird dispersion patterns
- Why not ZINB-WAaE??
In real also see some odd dispersion
Propose zero inflated negative binomal (ZINB) distribution
- Oh, this is the ZINB-WaVE author
Downweighting excess zeros helps reduce variance and improve power
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4
Continuous DE
- following SC across developmental trajectory
- slingshot tool (Street et al. 2018)
What do you do downstream of the trajectory analyses???
- most do DE between clusters
- which he argues is sub-optimal
- no fixed biological meaning
- hetergeneous
- inflate FDR with so many comparisons required
Propose NB-GAM
- so smooth out pseudotime trajectory with offset of sequencing depth (or other covariates?)
- cubic spline smoother right now
- this is pretty slick
- “very preliminary”
Barbara Englehardt (Englehardt)
A generative model for single-cell RNA-sequencing
Goal: develop low-dimensional generative model to address confounders
What is useful about this?
- normalization
- batch correction
- imputation
- visualization
Student’s t Gaussian process laternt varaible model (tGPLVM)
Compare viz results of tGPLVM to t-SNE, PCA, ZIFA
Many many examples … hard to explain. Looks roughly comparable to ZIFA, t-SNE for cell type / gene expression
Nice trajectory tree plotting
- absolutely no idea how the plot was made. Some kind of branching tree thing
Jeff Gaither (White)
Constraint for mRNA structure in human synonymous mutations
Synonymous mutations that alter mRNA stability have lower incidence rate in humas
AT -> CG stabilizing CG -> AT de-stabilizing
Bigger shift in stability (their own metric, no idea what it is) are less common (gnomAD)
mRNA sturcture previously implicated in
- translation speed
- mRNA half life
- miRNA/protein binding
Measure ‘deform structure’ with ViennaRNA
dMFE is delta minimum free energy
- low is stabilizing
- high is de-stabilizing
CpG transitions that are destabilizing are selected against
But is this causative???
- CpG content is partially responsible
- as is GC content
Fiona Dick (Tzoulis)
Differential isoform usage in Parkinson’s disease
Discussing DTE vs DTU E is for expression U is for usage DT is for Differential Transcript
U is more for proportion of transcript
Ribo-Zero capture (reduce 5’ 3’ bias) -> Salmon -> Tximport -> EVERYTHING*
- DRIMSeq –> StageR
- DEXSeq –> StageR
- ISAR + DRIMSEq
- ISAR (IsoformSwitchAnalyzeR)
Intersected all results -> OI ANOTHER CRAZY VENN
ISAR returned far fewer results
- it filters out a ton of transcripts
Of the 6 genes that all tools agree -> 5 switch from protein coding to not protein coding
Epigenetics and non-coding genome
Jordana Bell (Bell)
Interpreting variation in the human methylome
DNA methylation heritability in TwinSUK
- 400 female twins
- Imputed genotypes
- Infinium MethylationEPIC (772k CpGs after QC)
Heritability of MZ vs DZ twins
- MZ have higher cor, than DZ, which is higher than unrelated
- calculated heritability (h2)> 40% in 10.5% of methylome
- enhancer more heritable, promoter less so (really?)
meQTLs
- found a bunch (200k)
- more in CpG islands, promoters
Extending/validating in new cohort
- another 2k people
GoDMC consortium
- godmc.org.uk
- blood, Illumina 450k, 1000G imputation
- 27k samples
- many many covariates
- 190k (45%) of CpGs have >=1 mQTL
- 90% in cis
- Grouped CpGs with many cis and trans links (“community”)
Wouter Meuleman (Stamatoyannopoulos)
Delineation and annotation of the human regulatory landscape across 400+ cell types and states
@nameluem
Quick overview of DNaseI HS assay
400+ cell types and states
- 733 DNase-seq
- 439 unique cellular combinations
Convert continuous signal into consensus summits
- 3.5 million+ DNase1 HS sites
- 21% of genome
- primarily distal to TSS
- minority of DHs are constitutive
- most are unique to sample / cell
Decompose DHS x cell matrix into Non-negative matrix factorization (NMF)
- picked k of 16
- k is kind of like cluster
- which they hand assign to cell type / role
- Can even further drop down and assign genes to predominant k / color / assignment
- 20% of protein coding genes can get a label (Bonferonni FWER < 0.05)
Interpret genetic variation
- match against GWAS catalogs
- find matches for k / color against GWAS diseases
- regulatory terms informative in GWAS
Very cool work, I’ve found in my Distill project that the k/cluster assignment across the genome provided value in assignment of pathogenicity for DNA mutation
Maša Roller (Flicek)
Tissue-specific enhancer and promoter evolution in mammals
Gene expression / promoters more conserved than enhancers across great evolutionary distance
4 tissues, 10 mammals, 3 ChIP-seq, RNA-seq, 3 biological replicates
- H3K4me3, H3K27ac, H3K4me1 used to delineate active promoters, active enhnancers, primed enhancers
- fairly stable n of promoters, enhancers across evolution
- brain has largest num of active enhnacers
Maša IS USING AN UpSET I AM SO HAPPY
- showing counts of tissue shared and tissue specific enhnacers
- active are more tissue specific, poised much less so
- this would be a nightmare with a venn diagram
Primed enhancers evolving more rapidly than active enhancers
- or are less conserved
- tissue specific even less conserved
Regulatory element re-use
- enhancers used over and over again across tissues through evolution
- implication is that sets of enhancers crucial for activity?
Alexander Suh (Suh)
Mind the gap – interrogating the non-coding genome with single-molecule technologies
Discussion of genome assembly gaps with short reads with the puzzle analogy
Chicken genome is well assembled (big $ in chickens)
Look at bird of paradise
Paradise grove bird used as reference
- PACBIO
- Dovetail
- Illumina
- 10X linked-reads
- Hi-C PHASE genomics
Birds full of repeats
- especially sex chr
- make assembly tricky
- crazy collinearity plots (not sure if this is the right name for the plots that lay out chr level genome vs genome alignment)
- SO MANY LTRs in the sex chr for the paradise grove bird
- hundreds vs tens for the autosomes
How to align over complex / long repeats?
- optical mapping helps get longer (20kb -> 500kb or so)
Raquel Garcia-Perez (Juan)
Recent evolution of the epigenetic regulatory landscape in human and other primates
Comparative epigenomics
WGBS ChIP-seq (main ones) ATAC-Seq RNA-Seq
Across primates
What tissue(s)???
chromHMM -> 16 state model, orthologous regulatory elements
Gene annotation + 3C chromatin interactions = match enhancers with promoters or enhancers
trying to model gene expression differences with histone marks
- semi-succesful (explain 40% of variance)