#GI2018 - Day Two

Sep 18, 2018 8 min read GI2018, bioinformatics

Day Two
Personal and Medical Genomics
Comparative, Evolutionary, Metagenomics

Day Two

Personal and Medical Genomics

Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

Katie Pollard Keynote

A population genetic view of human chromatin organization

Noncoding human mutation and disease

which are causal?
how do they alter gene expression?
what genes / pathways do they affect?

core idea:

motif disruption (i.e. mutation alter binding site)

but, how about we think about genome as structure?

if structure (3D folding) is functional, any variation that modifies that would be deleterious

chromatin structure and pop genetics

discussion of TADs / HiC / BE (boundary elements)

compare rare variants in patients, variants in healthy humans, fixed diffs between primates

deletions depleted at boundary elements (strong negative selection)
also holds up across non-primates
less depletion in disease cohort (autism) variants

“suggestion broad role for enhancer hijacking in disease”

noncoding variation assessment should be TAD aware

try to predict how mutations change TAD/BE

kind of working (not as “sharp” as the real data)

chromatin and genetic structure (LD) both have similar struture

are they correlated?
not the same sizes (LD blocks shorter)
not correlated at 5kb to 2Mb

chromatin interactions (HiC) more enriched for (compared to closest gene or LD):

eQTL
GO enrichment terms
“most distal noncoding variants do not target closest expressed gene”

sequence motifs fail to explain key aspects of protein-DNA binding

30% of top 2000 ChIP-seq peaks have no sequence motif
many DNA binding proteins have very similar motifs but have disparate binding
seq motifs don’t capture DNA shape well (major / minor groove)

hypothesis: “DNA binding proteins recognize DNA shape beyond seq. motifs”

ShapeMF (DNAshape R package)
compare to classic motif with ENCODE ChIP-seq and HT-SELEX

shape motifs match up between ENCODE and HT-SELEX

shape motifs can occur without sequence motifs

most have both shape and motif patterns
some shape only and some motif only

Machine learning in genomics comments (Pollard kindly taking time out her talk to discuss geneeral ML issues)

try to model enhancer interaction with konwn enhancers, expressed genes, Hi-C data (TargetFinder)
common issues:
- balanced training data (genomic data usually isn’t)
- select features from whole dataset (feature selection should be inside cross validation)
- use biologically appropriate examples
- use the right statistic (PR, ROC, etc?)
- most ML models assume observations are independent and identically distributed
- genomic data is inherently not independent and identically distributed
- not trivial….different genome regions are inherently different (model doesn’t transfer)

Sri Kosuri (Kosuri)

Functional Impacts Of Rare Genetic Variation On Exome Recognition

Experimental assays (but huge)

multiplexed reporter assays (100k+ experiments) + DNA synthesis to do reverse genomics

https://www.biorxiv.org/content/early/2018/03/10/199927

Average genome has 4-5 million common variants and 50-100K rare variants (<0.5% AF)

how do rare variants exert large effect changes?

you see them in GTEx (under and over expression outliers)

Massively parallel exon splicing reporter

which mutations cause exon splicing issues
assayed 28k across 2,200 exon
mostly at splice junctions
but you still see plenty in exons and further away
in silico predictions are pretty bad (splicing specific models OK)
3.8% (1,505 / 27,900) led to large loss of exon recognition
can’t predict them…. (with existing tools)
nice to see a whole slide on why this isn’t the most perfect experiment ever
- short (100nt) exons used (biased)

Patrick Brennan (Nationwide Children’s)

Integration of whole genome, whole exome, and transcriptome pipeline for comprehensive genomic profiling of 60 pediatric cancer subjects

250x WES, 60X WGS (tumor + normal) RNA-seq (tumor only)

7 diff fusion detection algorithms

ugh, 7 way venn
they focus on the few which are consistently called (or rather ranked on overlap and evidence support)

RNA-seq

DESeq, Salmon, then Ingenuity for pathway analysis (TPM as input)

Differential comparison (RNA)

but to what???
oh the ‘treehouse’ data (not sure what this is)

Case study matching indel in gene with RNA expression difference, which helped to suggest tyrosine kinase inhibitor treatment

Case study IDing fusion gene between MET and RBPMS (chr 7 <-> chr 8)

Keep using ‘Churchill’ term for their software….annoying they don’t credit the tools that actually power it (GATK?)

Kaitlin Samocha (Hurles)

Evaluating the role of rare variation in children with developmental disorders

DDD

represent UK well
diagnostic rate of 35%

4,300 trios have 94 genes with high gene burden

gene test with poisson tests outputting p values for LoF, missense, synonymous

New approach

weight different types of mutations separately
gene score composed of weighted scores
multiple hypothesis correction (use prior knowledge)

where do weights come from?

observations from DDD cohort, comparing DDD enrichment by class (LoF, etc) to CADD

new method

finds more genes relative to poisson (old) method
applying to 10k trios

inherited variation

do cases have more rare damaging variants
cadd > 21
observed once or more ExAC
effect size of ~1.7 for probands with affect parents
~1.2 for other probands

causative enriched in high pLI genes

is this from cryptic recessive diagnoses?

probably not
high pLI genes not enriched for recessive disease genes

digenic inheritance?

parents each have damaging mutations (but dif from each other)
probably not (no strong enrichment in undiagnosed vs diagnosed)

Tracy Ballinger (Semple)

Modeling Double Strand Breaks to Interrogate Structural Variation in Cancer

SV (structural variation) prevalent across cancer

Nonrandom distribution

chromatin structure, sequence composition, and nuclear organization influence SV

built Random Forest model that correlates well with real data

Replication seq, open chromatin (DNase), pol2db, most important features
predicts well across different cell types

Compare SV prediction to cancer breakpoints

ID regions with more or less SV breakpoints than expected
also “passenger” regions which are not enriched

cancer ‘coldspots’:

enriched for active chromatin and deplted in fragile sites

Lucia Spangenberg (Naya)

Retrieving Charras genomic tracts from the Uruguayan admixed population

Charras all killed by state 187 years ago

80 Uruguayan genomes (Just 10 for this project?)

Q1: Still have Charrua in Uruguayan genomes?

Calculated local ancestry (rel to EUR or AFR)

PCA to see clustering of Native Uruguayan. Also had 20 genomes from Simons

Want to calculate Fst Tree with other ancient/original ameican populations

clusters with Guarani and Chane (expected!)

Build protogenome by merging the 10 WGS they did

99% covered

Comparative, Evolutionary, Metagenomics

Ellen Leffler (Kwiatkowski)

Paired sequencing of host and parasite genomes in severe malaria cases

191 million cases / year!

plasmodium falciparum (most commmon in Africa) and vivax (more common in not Africa)

MalariaGEN

GWAS for severe malaria cases (falciparum), matched population controls
found new structural var in GYPA/B/E

Big Questions:

parasite variation?
genotype <-> parasite susceptibilty?
co-evolution?
individual disease risk?

can we sequence parasite DNA from human blood samples?

1 - 5 million parasites per ul blood
napkin math gives 80X+ coverage for over 10,000 parasites / ul

align WGS to human and parasite genome

mean 197X coverage
eyeball median of 100X?

selective amplification of plasmodium falciparum gneome which gives ~50X enrichment

1.5 million SNPs total (parasite) 10 million SNPs total (human)

host-parasite association (looking at merozoite SNPs, which interacts with human)

get one nice looking peak in parasite gene MSP4 / MSP2 and human GCTN2 and GBGT1

Luca Penso-Dolfin (Di Palma)

The role of structural variants in the adaptive radiation of African Cichlids

African cichclids are canonical example of adaptive radition

Also, are freshwater fish, if you didn’t know

Hundreds of species, variation in jaw morphology, color, vision, behavior

Most studies focus on gene evolution

Looking at structural variation across three great East African lakes

M. zebra, P. nyererei, A. burtoni, N. brichardi (already sequenced)

One outgroup (O. niloticus)

Talking about SV identification using paired end reads and looking for internal distance and split reads

Is he rolling his own SV caller???

FIRST NO_TWITTER!

SV pipeline….which uses existing SV tools….not certain why NO_TWITTER…but I won’t give any details

duplication net gain of 9.2 events per million years across the four species

lot fewer inversions

about 0.5 events per million years

SECOND NO_TWITTER

cool enrichment of some genes / GO terms in inversions

Mario Caccamo (NIAB)

Understanding the genetic components controlling apomixis

WHAT IS APOMIXIS I HAVE NO IDEA

SOMETHING WITH CROPS?

Plants often have polyploid genomes

Wheat is hexaploid …. and 17Gb (reminder human is 3Gb)

There are 60k wheat lines. (Whoah)

Apomixis!

clonal reproduction through seeds
asexual reproduction

Weeping lovegrass

grass from southern africa
polyploid
transcriptome from sexual and asexual
- 4 candidate genes

Make diploid -> pacbio + dovetail -> assembly

OH NO CRAZY VENN DIAGRAM USE AN UpSet YO

Then sequenced tetraploid and the asexual types and mapped back to assembly

4 candidate genes missing in 2x assembly

hypothesis is that they get dropped / incompatible with sexual state
no mention of the genes
functional characteriziation work in arabidopsis and rice in progress

Carla Cummins (Flicek)

Comparative analysis of hundreds of vertebrate genomes in Ensembl

What is Ensembl?

genome stuffs (my summary)
ensembl.org

Data won’t stop coming

Ensembl release 94
154 vertebrate genomes
clade based analysis
huge jump: 41 teleosts this release

Bottlenecks:

all vs all methods scale quadratically
- so try to remove redundancies
- work more within in clades
read/write speeds
- genome in FastA, indexed and read with Bio::DB::HTS (htslib)
multiple sequence alignment
- do on server (5x faster)
humans
- automate manual tasks
- infer species tree with Mash and NJ

workflows - rule based (XML) - infer parameters

get more algorithms on linear growth - ProgressCactus (?) - https://github.com/glennhickey/progressiveCactus ?

conference GI2018 bioinformatics talks