#GI2018 - Day Three

Sep 19, 2018 9 min read GI2018, bioinformatics

Day 3
Rafael Irizarry Keynote
Session 4: Transcriptomics, Alternative Splicing and Gene Predictions
Epigenetics and non-coding genome

Day 3

Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

Rafael Irizarry Keynote

Understanding variabliity in high throughput data

Typical workflow raw data -> pre-processing -> analysis -> discovery (false?????)

“When you find an unexpected result, be skeptical, check for systematic errors”

“Always, look at the data”

“Dynamite plots needs to die” (barplot + SE/SD)

Plug for a blog post by me: http://davemcg.github.io/post/let-s-plot-4-r-vs-excel/
Show the data (boxplot + individual points)

Showing batch effect example where huge diff in gene expression between groups is actually derived from technical issues (two groups were effectively done at different times)

Lin et al. 2014 (notorious paper comparing tx between human and mouse the claimed dif tissues within species more common to each other)

Yoav Gilad quickly pointed out that the mouse and human were done on different sequencers

So they re-did on same sequencer machine….got similar result

Not a new “result” (yanai 2004 had similar conclusion with microarray)

Rafael pointing out that high genes are high and low are low…always, so correlation will always be high.

Once you fix probe bias (microarray), then tissues cluster together.

But what about Lin RNA-seq?

A lot is explained by:

number of transcripts
GC content

Single Cell RNA-seq

trying to disentangle batch effect from tumor type
actually proportion of zeros is driving variability
- this is probably not biology

Do DNAm(ethylation) changes drive gene expression changes?

Discussing Ford et al. 2017 bioRxiv

“Distinguishing consistent differences from random ones is challenging”

Talking about how technical variability and biological variability are different and must be considered when designing experiment (moar replicates!)

Do we trust individual measurements?

when you do lots of tests, individual tests can be wrong
region test was done for CpG methylation (DMR test Jaffe 2012)
can also shuffle data (shuffle samples, not points) and see how often a diff can occur by chance
region + bootstrapping powerful approach (plus looking at replicates individually!!!)

Session 4: Transcriptomics, Alternative Splicing and Gene Predictions

Mark Robinson (Robinson)

On the analysis of long-read sequencing data for gene expression

Illumina <-> ONT (Oxford Nanopore) cheap <-> expensive short fragments <-> full length lower error rates <-> higher error rates

DataSet

WT vs Srpk1-KO
ONT (various preps) and Illumina
1M reads / sample (ONT); 30M reads / sample (Illumina)

25% mismatch with ONT vs <1% for Illumina (but they still map)

ONT doesn’t look useful for variation in sequence (way too noisy)

Do we actually get full length transcripts?

for longer ones…not really
showing how a longer transcript has worse coverage with ONT

Quantification ONT <-> Illumina

looks pretty good for protein coding genes
more discrepancy at the tx level (still OK)

ONT

minimap2 + salmon current choice for tooling

ONT direct RNA or ONT cDNA???

look pretty comparable
direct RNA requires crazy input amounts, probably not worth it unless you are looking at RNA base modifications
ONT has new stranded cDNA protocol / kit

Diff expression analysis

once you account for way lower depth (again 1M vs 30M) for ONT, roughly comparable

Definitely not time to jump over for more standard DE type RNAseq tests

Hagen Tilgner (Tilgner)

Single-cell isoform RNA sequencing (ScISOr-Seq) across thousands of cells reveals isoforms of cerebellar cell types.

Need long-read RNA seq to confidently ID isoforms

ScISOr-Seq

10X uses 3’ sequencing
Add PacBio ISO-Seq and find the 10X barcode
blend 10X and PacBio data

Barcode ID (long read data can be noisy…)

99.99% specificity with PacBio
Only 58% of molecules have barcode (74% in Illumina)
- not catastrophic, more annoying

It works

validate some junctions with MS/MS

Plug for isoformAtlas.com

See some compelling cell type specific isoform usage

Nikka Keivanfar (Church, 10X)

Bootstrapping Biology: Quick and easy de novo genome assembly to enable single cell gene expression analysis

Linked-Reads + Supernova + Cactus/CAT

Supernova does assembly (with phase)

Supernova v2 can get N50 over 100+ kb for many species (HUman, Hummingbird, Drosophila)

Zebrafish is tougher (only around 20kb)

CAT

Comparative Annotation Toolkit
“provides a flexible way to simultaneously annotate entire clades and identify orthology relationships”
doi.org/10.1101/231118
https://genome.cshlp.org/content/28/7/1029

“Blackjack” (Donkey) blood sequenced with 10X

made assembly with Supernova
better metrics than other donkey genomes

Used assembly (+CAT to get gene info) for scRNA-seq of the blood

labels major blood types (T cells, B cells, NK, monocytes, plasma, etc)

Koen Van den Berge (Clement)

Discrete and continuous differential expression analysis for single-cell RNA-seq data

scRNA-seq data is WAY noiser than bulk RNA-seq (at least cell to cell)

Bulk RNAseq generally use negative binomial models (edgeR, DESeq2)

Sort of surprisingly, these bulk RNAseq methods seem to work well in scRNA

But zeros…..showing simulations where adding zeros getting weird dispersion patterns

Why not ZINB-WAaE??

In real also see some odd dispersion

Propose zero inflated negative binomal (ZINB) distribution

Oh, this is the ZINB-WaVE author

Downweighting excess zeros helps reduce variance and improve power

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4

Continuous DE

following SC across developmental trajectory
slingshot tool (Street et al. 2018)
- https://github.com/kstreet13/slingshot

What do you do downstream of the trajectory analyses???

most do DE between clusters
which he argues is sub-optimal
- no fixed biological meaning
- hetergeneous
- inflate FDR with so many comparisons required

Propose NB-GAM

so smooth out pseudotime trajectory with offset of sequencing depth (or other covariates?)
cubic spline smoother right now
this is pretty slick
“very preliminary”

Barbara Englehardt (Englehardt)

A generative model for single-cell RNA-sequencing

Goal: develop low-dimensional generative model to address confounders

What is useful about this?

normalization
batch correction
imputation
visualization

Student’s t Gaussian process laternt varaible model (tGPLVM)

Compare viz results of tGPLVM to t-SNE, PCA, ZIFA

Many many examples … hard to explain. Looks roughly comparable to ZIFA, t-SNE for cell type / gene expression

Nice trajectory tree plotting

absolutely no idea how the plot was made. Some kind of branching tree thing

Jeff Gaither (White)

Constraint for mRNA structure in human synonymous mutations

Synonymous mutations that alter mRNA stability have lower incidence rate in humas

AT -> CG stabilizing CG -> AT de-stabilizing

Bigger shift in stability (their own metric, no idea what it is) are less common (gnomAD)

mRNA sturcture previously implicated in

translation speed
mRNA half life
miRNA/protein binding

Measure ‘deform structure’ with ViennaRNA

dMFE is delta minimum free energy

low is stabilizing
high is de-stabilizing

CpG transitions that are destabilizing are selected against

But is this causative???

CpG content is partially responsible
as is GC content

Fiona Dick (Tzoulis)

Differential isoform usage in Parkinson’s disease

Discussing DTE vs DTU E is for expression U is for usage DT is for Differential Transcript

U is more for proportion of transcript

Ribo-Zero capture (reduce 5’ 3’ bias) -> Salmon -> Tximport -> EVERYTHING*

DRIMSeq –> StageR
DEXSeq –> StageR
ISAR + DRIMSEq
ISAR (IsoformSwitchAnalyzeR)

Intersected all results -> OI ANOTHER CRAZY VENN

ISAR returned far fewer results

it filters out a ton of transcripts

Of the 6 genes that all tools agree -> 5 switch from protein coding to not protein coding

Epigenetics and non-coding genome

Jordana Bell (Bell)

Interpreting variation in the human methylome

DNA methylation heritability in TwinSUK

400 female twins
Imputed genotypes
Infinium MethylationEPIC (772k CpGs after QC)

Heritability of MZ vs DZ twins

MZ have higher cor, than DZ, which is higher than unrelated
calculated heritability (h2)> 40% in 10.5% of methylome
enhancer more heritable, promoter less so (really?)

meQTLs

found a bunch (200k)
more in CpG islands, promoters

Extending/validating in new cohort

another 2k people

GoDMC consortium

godmc.org.uk
blood, Illumina 450k, 1000G imputation
27k samples
many many covariates
190k (45%) of CpGs have >=1 mQTL
- 90% in cis
Grouped CpGs with many cis and trans links (“community”)

Wouter Meuleman (Stamatoyannopoulos)

Delineation and annotation of the human regulatory landscape across 400+ cell types and states

@nameluem

Quick overview of DNaseI HS assay

400+ cell types and states

733 DNase-seq
439 unique cellular combinations

Convert continuous signal into consensus summits

3.5 million+ DNase1 HS sites
21% of genome
primarily distal to TSS
minority of DHs are constitutive
- most are unique to sample / cell

Decompose DHS x cell matrix into Non-negative matrix factorization (NMF)

picked k of 16
k is kind of like cluster
which they hand assign to cell type / role
Can even further drop down and assign genes to predominant k / color / assignment
- 20% of protein coding genes can get a label (Bonferonni FWER < 0.05)

Interpret genetic variation

match against GWAS catalogs
find matches for k / color against GWAS diseases
regulatory terms informative in GWAS

Very cool work, I’ve found in my Distill project that the k/cluster assignment across the genome provided value in assignment of pathogenicity for DNA mutation

Maša Roller (Flicek)

Tissue-specific enhancer and promoter evolution in mammals

Gene expression / promoters more conserved than enhancers across great evolutionary distance

4 tissues, 10 mammals, 3 ChIP-seq, RNA-seq, 3 biological replicates

H3K4me3, H3K27ac, H3K4me1 used to delineate active promoters, active enhnancers, primed enhancers
fairly stable n of promoters, enhancers across evolution
brain has largest num of active enhnacers

Maša IS USING AN UpSET I AM SO HAPPY

showing counts of tissue shared and tissue specific enhnacers
- active are more tissue specific, poised much less so
- this would be a nightmare with a venn diagram

Primed enhancers evolving more rapidly than active enhancers

or are less conserved
tissue specific even less conserved

Regulatory element re-use

enhancers used over and over again across tissues through evolution
- implication is that sets of enhancers crucial for activity?

Alexander Suh (Suh)

Mind the gap – interrogating the non-coding genome with single-molecule technologies

Discussion of genome assembly gaps with short reads with the puzzle analogy

Chicken genome is well assembled (big $ in chickens)

Look at bird of paradise

Paradise grove bird used as reference

PACBIO
Dovetail
Illumina
10X linked-reads
Hi-C PHASE genomics

Birds full of repeats

especially sex chr
make assembly tricky
crazy collinearity plots (not sure if this is the right name for the plots that lay out chr level genome vs genome alignment)
SO MANY LTRs in the sex chr for the paradise grove bird
- hundreds vs tens for the autosomes

How to align over complex / long repeats?

optical mapping helps get longer (20kb -> 500kb or so)

Raquel Garcia-Perez (Juan)

Recent evolution of the epigenetic regulatory landscape in human and other primates

Comparative epigenomics

WGBS ChIP-seq (main ones) ATAC-Seq RNA-Seq

Across primates

What tissue(s)???

chromHMM -> 16 state model, orthologous regulatory elements

Gene annotation + 3C chromatin interactions = match enhancers with promoters or enhancers

trying to model gene expression differences with histone marks

semi-succesful (explain 40% of variance)

conference GI2018 bioinformatics talks