#GI2018 - Day Three

Day 3

Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

Rafael Irizarry Keynote

Understanding variabliity in high throughput data

Typical workflow raw data -> pre-processing -> analysis -> discovery (false?????)

“When you find an unexpected result, be skeptical, check for systematic errors”

“Always, look at the data”

“Dynamite plots needs to die” (barplot + SE/SD)

Showing batch effect example where huge diff in gene expression between groups is actually derived from technical issues (two groups were effectively done at different times)

Lin et al. 2014 (notorious paper comparing tx between human and mouse the claimed dif tissues within species more common to each other)

Yoav Gilad quickly pointed out that the mouse and human were done on different sequencers

So they re-did on same sequencer machine….got similar result

Not a new “result” (yanai 2004 had similar conclusion with microarray)

Rafael pointing out that high genes are high and low are low…always, so correlation will always be high.

Once you fix probe bias (microarray), then tissues cluster together.

But what about Lin RNA-seq?

A lot is explained by:

  • number of transcripts
  • GC content

Single Cell RNA-seq

  • trying to disentangle batch effect from tumor type
  • actually proportion of zeros is driving variability
    • this is probably not biology

Do DNAm(ethylation) changes drive gene expression changes?

Discussing Ford et al. 2017 bioRxiv

“Distinguishing consistent differences from random ones is challenging”

Talking about how technical variability and biological variability are different and must be considered when designing experiment (moar replicates!)

Do we trust individual measurements?

  • when you do lots of tests, individual tests can be wrong
  • region test was done for CpG methylation (DMR test Jaffe 2012)
  • can also shuffle data (shuffle samples, not points) and see how often a diff can occur by chance
  • region + bootstrapping powerful approach (plus looking at replicates individually!!!)

Session 4: Transcriptomics, Alternative Splicing and Gene Predictions

Mark Robinson (Robinson)

On the analysis of long-read sequencing data for gene expression

Illumina <-> ONT (Oxford Nanopore) cheap <-> expensive short fragments <-> full length lower error rates <-> higher error rates


  • WT vs Srpk1-KO
  • ONT (various preps) and Illumina
  • 1M reads / sample (ONT); 30M reads / sample (Illumina)

25% mismatch with ONT vs <1% for Illumina (but they still map)

ONT doesn’t look useful for variation in sequence (way too noisy)

Do we actually get full length transcripts?

  • for longer ones…not really
  • showing how a longer transcript has worse coverage with ONT

Quantification ONT <-> Illumina

  • looks pretty good for protein coding genes
  • more discrepancy at the tx level (still OK)


  • minimap2 + salmon current choice for tooling

ONT direct RNA or ONT cDNA???

  • look pretty comparable
  • direct RNA requires crazy input amounts, probably not worth it unless you are looking at RNA base modifications
  • ONT has new stranded cDNA protocol / kit

Diff expression analysis

  • once you account for way lower depth (again 1M vs 30M) for ONT, roughly comparable

Definitely not time to jump over for more standard DE type RNAseq tests

Hagen Tilgner (Tilgner)

Single-cell isoform RNA sequencing (ScISOr-Seq) across thousands of cells reveals isoforms of cerebellar cell types.

Need long-read RNA seq to confidently ID isoforms


  • 10X uses 3’ sequencing
  • Add PacBio ISO-Seq and find the 10X barcode
  • blend 10X and PacBio data

Barcode ID (long read data can be noisy…)

  • 99.99% specificity with PacBio
  • Only 58% of molecules have barcode (74% in Illumina)
    • not catastrophic, more annoying

It works

  • validate some junctions with MS/MS

Plug for isoformAtlas.com

See some compelling cell type specific isoform usage

Nikka Keivanfar (Church, 10X)

Bootstrapping Biology: Quick and easy de novo genome assembly to enable single cell gene expression analysis

Linked-Reads + Supernova + Cactus/CAT

Supernova does assembly (with phase)

Supernova v2 can get N50 over 100+ kb for many species (HUman, Hummingbird, Drosophila)

  • Zebrafish is tougher (only around 20kb)


“Blackjack” (Donkey) blood sequenced with 10X

  • made assembly with Supernova
  • better metrics than other donkey genomes

Used assembly (+CAT to get gene info) for scRNA-seq of the blood

  • labels major blood types (T cells, B cells, NK, monocytes, plasma, etc)

Koen Van den Berge (Clement)

Discrete and continuous differential expression analysis for single-cell RNA-seq data

scRNA-seq data is WAY noiser than bulk RNA-seq (at least cell to cell)

Bulk RNAseq generally use negative binomial models (edgeR, DESeq2)

Sort of surprisingly, these bulk RNAseq methods seem to work well in scRNA

But zeros…..showing simulations where adding zeros getting weird dispersion patterns

  • Why not ZINB-WAaE??

In real also see some odd dispersion

Propose zero inflated negative binomal (ZINB) distribution

  • Oh, this is the ZINB-WaVE author

Downweighting excess zeros helps reduce variance and improve power


Continuous DE

What do you do downstream of the trajectory analyses???

  • most do DE between clusters
  • which he argues is sub-optimal
    • no fixed biological meaning
    • hetergeneous
    • inflate FDR with so many comparisons required

Propose NB-GAM

  • so smooth out pseudotime trajectory with offset of sequencing depth (or other covariates?)
  • cubic spline smoother right now
  • this is pretty slick
  • “very preliminary”

Barbara Englehardt (Englehardt)

A generative model for single-cell RNA-sequencing

Goal: develop low-dimensional generative model to address confounders

What is useful about this?

  • normalization
  • batch correction
  • imputation
  • visualization

Student’s t Gaussian process laternt varaible model (tGPLVM)

Compare viz results of tGPLVM to t-SNE, PCA, ZIFA

Many many examples … hard to explain. Looks roughly comparable to ZIFA, t-SNE for cell type / gene expression

Nice trajectory tree plotting

  • absolutely no idea how the plot was made. Some kind of branching tree thing

Jeff Gaither (White)

Constraint for mRNA structure in human synonymous mutations

Synonymous mutations that alter mRNA stability have lower incidence rate in humas

AT -> CG stabilizing CG -> AT de-stabilizing

Bigger shift in stability (their own metric, no idea what it is) are less common (gnomAD)

mRNA sturcture previously implicated in

  • translation speed
  • mRNA half life
  • miRNA/protein binding

Measure ‘deform structure’ with ViennaRNA

dMFE is delta minimum free energy

  • low is stabilizing
  • high is de-stabilizing

CpG transitions that are destabilizing are selected against

But is this causative???

  • CpG content is partially responsible
  • as is GC content

Fiona Dick (Tzoulis)

Differential isoform usage in Parkinson’s disease

Discussing DTE vs DTU E is for expression U is for usage DT is for Differential Transcript

U is more for proportion of transcript

Ribo-Zero capture (reduce 5’ 3’ bias) -> Salmon -> Tximport -> EVERYTHING*

  • DRIMSeq –> StageR
  • DEXSeq –> StageR
  • ISAR (IsoformSwitchAnalyzeR)

Intersected all results -> OI ANOTHER CRAZY VENN

ISAR returned far fewer results

  • it filters out a ton of transcripts

Of the 6 genes that all tools agree -> 5 switch from protein coding to not protein coding

Epigenetics and non-coding genome

Jordana Bell (Bell)

Interpreting variation in the human methylome

DNA methylation heritability in TwinSUK

  • 400 female twins
  • Imputed genotypes
  • Infinium MethylationEPIC (772k CpGs after QC)

Heritability of MZ vs DZ twins

  • MZ have higher cor, than DZ, which is higher than unrelated
  • calculated heritability (h2)> 40% in 10.5% of methylome
  • enhancer more heritable, promoter less so (really?)


  • found a bunch (200k)
  • more in CpG islands, promoters

Extending/validating in new cohort

  • another 2k people

GoDMC consortium

  • godmc.org.uk
  • blood, Illumina 450k, 1000G imputation
  • 27k samples
  • many many covariates
  • 190k (45%) of CpGs have >=1 mQTL
    • 90% in cis
  • Grouped CpGs with many cis and trans links (“community”)

Wouter Meuleman (Stamatoyannopoulos)

Delineation and annotation of the human regulatory landscape across 400+ cell types and states


Quick overview of DNaseI HS assay

400+ cell types and states

  • 733 DNase-seq
  • 439 unique cellular combinations

Convert continuous signal into consensus summits

  • 3.5 million+ DNase1 HS sites
  • 21% of genome
  • primarily distal to TSS
  • minority of DHs are constitutive
    • most are unique to sample / cell

Decompose DHS x cell matrix into Non-negative matrix factorization (NMF)

  • picked k of 16
  • k is kind of like cluster
  • which they hand assign to cell type / role
  • Can even further drop down and assign genes to predominant k / color / assignment
    • 20% of protein coding genes can get a label (Bonferonni FWER < 0.05)

Interpret genetic variation

  • match against GWAS catalogs
  • find matches for k / color against GWAS diseases
  • regulatory terms informative in GWAS

Very cool work, I’ve found in my Distill project that the k/cluster assignment across the genome provided value in assignment of pathogenicity for DNA mutation

Maša Roller (Flicek)

Tissue-specific enhancer and promoter evolution in mammals

Gene expression / promoters more conserved than enhancers across great evolutionary distance

4 tissues, 10 mammals, 3 ChIP-seq, RNA-seq, 3 biological replicates

  • H3K4me3, H3K27ac, H3K4me1 used to delineate active promoters, active enhnancers, primed enhancers
  • fairly stable n of promoters, enhancers across evolution
  • brain has largest num of active enhnacers


  • showing counts of tissue shared and tissue specific enhnacers
    • active are more tissue specific, poised much less so
    • this would be a nightmare with a venn diagram

Primed enhancers evolving more rapidly than active enhancers

  • or are less conserved
  • tissue specific even less conserved

Regulatory element re-use

  • enhancers used over and over again across tissues through evolution
    • implication is that sets of enhancers crucial for activity?

Alexander Suh (Suh)

Mind the gap – interrogating the non-coding genome with single-molecule technologies

Discussion of genome assembly gaps with short reads with the puzzle analogy

Chicken genome is well assembled (big $ in chickens)

Look at bird of paradise

Paradise grove bird used as reference

  • Dovetail
  • Illumina
  • 10X linked-reads
  • Hi-C PHASE genomics

Birds full of repeats

  • especially sex chr
  • make assembly tricky
  • crazy collinearity plots (not sure if this is the right name for the plots that lay out chr level genome vs genome alignment)
  • SO MANY LTRs in the sex chr for the paradise grove bird
    • hundreds vs tens for the autosomes

How to align over complex / long repeats?

  • optical mapping helps get longer (20kb -> 500kb or so)

Raquel Garcia-Perez (Juan)

Recent evolution of the epigenetic regulatory landscape in human and other primates

Comparative epigenomics

WGBS ChIP-seq (main ones) ATAC-Seq RNA-Seq

Across primates

What tissue(s)???

chromHMM -> 16 state model, orthologous regulatory elements

Gene annotation + 3C chromatin interactions = match enhancers with promoters or enhancers

trying to model gene expression differences with histone marks

  • semi-succesful (explain 40% of variance)


comments powered by Disqus