Day Two
Personal and Medical Genomics
Very sparse and poorly written notes covering #GI2018.
Typos everywhere. Things may change dramatically over time as I scan back through notes.
I’ve tried to respect #notwitter. Will be updated periodically.
Speaker (Lab | Group)
Katie Pollard Keynote
A population genetic view of human chromatin organization
Noncoding human mutation and disease
- which are causal?
- how do they alter gene expression?
- what genes / pathways do they affect?
core idea:
- motif disruption (i.e. mutation alter binding site)
but, how about we think about genome as structure?
- if structure (3D folding) is functional, any variation that modifies that would be deleterious
chromatin structure and pop genetics
discussion of TADs / HiC / BE (boundary elements)
compare rare variants in patients, variants in healthy humans, fixed diffs between primates
- deletions depleted at boundary elements (strong negative selection)
- also holds up across non-primates
- less depletion in disease cohort (autism) variants
“suggestion broad role for enhancer hijacking in disease”
noncoding variation assessment should be TAD aware
try to predict how mutations change TAD/BE
- kind of working (not as “sharp” as the real data)
chromatin and genetic structure (LD) both have similar struture
- are they correlated?
- not the same sizes (LD blocks shorter)
- not correlated at 5kb to 2Mb
chromatin interactions (HiC) more enriched for (compared to closest gene or LD):
- eQTL
- GO enrichment terms
- “most distal noncoding variants do not target closest expressed gene”
sequence motifs fail to explain key aspects of protein-DNA binding
- 30% of top 2000 ChIP-seq peaks have no sequence motif
- many DNA binding proteins have very similar motifs but have disparate binding
- seq motifs don’t capture DNA shape well (major / minor groove)
hypothesis: “DNA binding proteins recognize DNA shape beyond seq. motifs”
- ShapeMF (DNAshape R package)
- compare to classic motif with ENCODE ChIP-seq and HT-SELEX
shape motifs match up between ENCODE and HT-SELEX
shape motifs can occur without sequence motifs
- most have both shape and motif patterns
- some shape only and some motif only
Machine learning in genomics comments (Pollard kindly taking time out her talk to discuss geneeral ML issues)
- try to model enhancer interaction with konwn enhancers, expressed genes, Hi-C data (TargetFinder)
- common issues:
- balanced training data (genomic data usually isn’t)
- select features from whole dataset (feature selection should be inside cross validation)
- use biologically appropriate examples
- use the right statistic (PR, ROC, etc?)
- most ML models assume observations are independent and identically distributed
- genomic data is inherently not independent and identically distributed
- not trivial….different genome regions are inherently different (model doesn’t transfer)
Sri Kosuri (Kosuri)
Functional Impacts Of Rare Genetic Variation On Exome Recognition
Experimental assays (but huge)
multiplexed reporter assays (100k+ experiments) + DNA synthesis to do reverse genomics
https://www.biorxiv.org/content/early/2018/03/10/199927
Average genome has 4-5 million common variants and 50-100K rare variants (<0.5% AF)
how do rare variants exert large effect changes?
- you see them in GTEx (under and over expression outliers)
Massively parallel exon splicing reporter
- which mutations cause exon splicing issues
- assayed 28k across 2,200 exon
- mostly at splice junctions
- but you still see plenty in exons and further away
- in silico predictions are pretty bad (splicing specific models OK)
- 3.8% (1,505 / 27,900) led to large loss of exon recognition
- can’t predict them…. (with existing tools)
- nice to see a whole slide on why this isn’t the most perfect experiment ever
- short (100nt) exons used (biased)
Patrick Brennan (Nationwide Children’s)
Integration of whole genome, whole exome, and transcriptome pipeline for comprehensive genomic profiling of 60 pediatric cancer subjects
250x WES, 60X WGS (tumor + normal) RNA-seq (tumor only)
7 diff fusion detection algorithms
- ugh, 7 way venn
- they focus on the few which are consistently called (or rather ranked on overlap and evidence support)
RNA-seq
- DESeq, Salmon, then Ingenuity for pathway analysis (TPM as input)
Differential comparison (RNA)
- but to what???
- oh the ‘treehouse’ data (not sure what this is)
Case study matching indel in gene with RNA expression difference, which helped to suggest tyrosine kinase inhibitor treatment
Case study IDing fusion gene between MET and RBPMS (chr 7 <-> chr 8)
Keep using ‘Churchill’ term for their software….annoying they don’t credit the tools that actually power it (GATK?)
Kaitlin Samocha (Hurles)
Evaluating the role of rare variation in children with developmental disorders
DDD
- represent UK well
- diagnostic rate of 35%
4,300 trios have 94 genes with high gene burden
- gene test with poisson tests outputting p values for LoF, missense, synonymous
New approach
- weight different types of mutations separately
- gene score composed of weighted scores
- multiple hypothesis correction (use prior knowledge)
where do weights come from?
- observations from DDD cohort, comparing DDD enrichment by class (LoF, etc) to CADD
new method
- finds more genes relative to poisson (old) method
- applying to 10k trios
inherited variation
- do cases have more rare damaging variants
- cadd > 21
- observed once or more ExAC
- effect size of ~1.7 for probands with affect parents
- ~1.2 for other probands
causative enriched in high pLI genes
is this from cryptic recessive diagnoses?
- probably not
- high pLI genes not enriched for recessive disease genes
digenic inheritance?
- parents each have damaging mutations (but dif from each other)
- probably not (no strong enrichment in undiagnosed vs diagnosed)
Tracy Ballinger (Semple)
Modeling Double Strand Breaks to Interrogate Structural Variation in Cancer
SV (structural variation) prevalent across cancer
Nonrandom distribution
- chromatin structure, sequence composition, and nuclear organization influence SV
built Random Forest model that correlates well with real data
- Replication seq, open chromatin (DNase), pol2db, most important features
- predicts well across different cell types
Compare SV prediction to cancer breakpoints
- ID regions with more or less SV breakpoints than expected
- also “passenger” regions which are not enriched
cancer ‘coldspots’:
- enriched for active chromatin and deplted in fragile sites
Lucia Spangenberg (Naya)
Retrieving Charras genomic tracts from the Uruguayan admixed population
Charras all killed by state 187 years ago
80 Uruguayan genomes (Just 10 for this project?)
Q1: Still have Charrua in Uruguayan genomes?
Calculated local ancestry (rel to EUR or AFR)
PCA to see clustering of Native Uruguayan. Also had 20 genomes from Simons
Want to calculate Fst Tree with other ancient/original ameican populations
- clusters with Guarani and Chane (expected!)
Build protogenome by merging the 10 WGS they did
- 99% covered
Comparative, Evolutionary, Metagenomics
Ellen Leffler (Kwiatkowski)
Paired sequencing of host and parasite genomes in severe malaria cases
191 million cases / year!
plasmodium falciparum (most commmon in Africa) and vivax (more common in not Africa)
MalariaGEN
- GWAS for severe malaria cases (falciparum), matched population controls
- found new structural var in GYPA/B/E
Big Questions:
- parasite variation?
- genotype <-> parasite susceptibilty?
- co-evolution?
- individual disease risk?
can we sequence parasite DNA from human blood samples?
- 1 - 5 million parasites per ul blood
- napkin math gives 80X+ coverage for over 10,000 parasites / ul
align WGS to human and parasite genome
- mean 197X coverage
- eyeball median of 100X?
selective amplification of plasmodium falciparum gneome which gives ~50X enrichment
1.5 million SNPs total (parasite) 10 million SNPs total (human)
host-parasite association (looking at merozoite SNPs, which interacts with human)
- get one nice looking peak in parasite gene MSP4 / MSP2 and human GCTN2 and GBGT1
Luca Penso-Dolfin (Di Palma)
The role of structural variants in the adaptive radiation of African Cichlids
African cichclids are canonical example of adaptive radition
Also, are freshwater fish, if you didn’t know
Hundreds of species, variation in jaw morphology, color, vision, behavior
Most studies focus on gene evolution
Looking at structural variation across three great East African lakes
M. zebra, P. nyererei, A. burtoni, N. brichardi (already sequenced)
One outgroup (O. niloticus)
Talking about SV identification using paired end reads and looking for internal distance and split reads
- Is he rolling his own SV caller???
FIRST NO_TWITTER!
- SV pipeline….which uses existing SV tools….not certain why NO_TWITTER…but I won’t give any details
duplication net gain of 9.2 events per million years across the four species
lot fewer inversions
- about 0.5 events per million years
SECOND NO_TWITTER
- cool enrichment of some genes / GO terms in inversions
Mario Caccamo (NIAB)
Understanding the genetic components controlling apomixis
WHAT IS APOMIXIS I HAVE NO IDEA
SOMETHING WITH CROPS?
Plants often have polyploid genomes
Wheat is hexaploid …. and 17Gb (reminder human is 3Gb)
There are 60k wheat lines. (Whoah)
Apomixis!
- clonal reproduction through seeds
- asexual reproduction
Weeping lovegrass
- grass from southern africa
- polyploid
- transcriptome from sexual and asexual
- 4 candidate genes
Make diploid -> pacbio + dovetail -> assembly
OH NO CRAZY VENN DIAGRAM USE AN UpSet YO
Then sequenced tetraploid and the asexual types and mapped back to assembly
4 candidate genes missing in 2x assembly
- hypothesis is that they get dropped / incompatible with sexual state
- no mention of the genes
- functional characteriziation work in arabidopsis and rice in progress
Carla Cummins (Flicek)
Comparative analysis of hundreds of vertebrate genomes in Ensembl
What is Ensembl?
- genome stuffs (my summary)
- ensembl.org
Data won’t stop coming
- Ensembl release 94
- 154 vertebrate genomes
- clade based analysis
- huge jump: 41 teleosts this release
Bottlenecks:
- all vs all methods scale quadratically
- so try to remove redundancies
- work more within in clades
- read/write speeds
- genome in FastA, indexed and read with Bio::DB::HTS (htslib)
- multiple sequence alignment
- do on server (5x faster)
- humans
- automate manual tasks
- infer species tree with Mash and NJ
workflows - rule based (XML) - infer parameters
get more algorithms on linear growth - ProgressCactus (?) - https://github.com/glennhickey/progressiveCactus ?