#GI2018 - Day Two

Day Two

Personal and Medical Genomics

Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

Katie Pollard Keynote

A population genetic view of human chromatin organization

Noncoding human mutation and disease

  • which are causal?
  • how do they alter gene expression?
  • what genes / pathways do they affect?

core idea:

  • motif disruption (i.e. mutation alter binding site)

but, how about we think about genome as structure?

  • if structure (3D folding) is functional, any variation that modifies that would be deleterious

chromatin structure and pop genetics

discussion of TADs / HiC / BE (boundary elements)

compare rare variants in patients, variants in healthy humans, fixed diffs between primates

  • deletions depleted at boundary elements (strong negative selection)
  • also holds up across non-primates
  • less depletion in disease cohort (autism) variants

“suggestion broad role for enhancer hijacking in disease”

noncoding variation assessment should be TAD aware

try to predict how mutations change TAD/BE

  • kind of working (not as “sharp” as the real data)

chromatin and genetic structure (LD) both have similar struture

  • are they correlated?
  • not the same sizes (LD blocks shorter)
  • not correlated at 5kb to 2Mb

chromatin interactions (HiC) more enriched for (compared to closest gene or LD):

  • eQTL
  • GO enrichment terms
  • “most distal noncoding variants do not target closest expressed gene”

sequence motifs fail to explain key aspects of protein-DNA binding

  • 30% of top 2000 ChIP-seq peaks have no sequence motif
  • many DNA binding proteins have very similar motifs but have disparate binding
  • seq motifs don’t capture DNA shape well (major / minor groove)

hypothesis: “DNA binding proteins recognize DNA shape beyond seq. motifs”

shape motifs match up between ENCODE and HT-SELEX

shape motifs can occur without sequence motifs

  • most have both shape and motif patterns
  • some shape only and some motif only

Machine learning in genomics comments (Pollard kindly taking time out her talk to discuss geneeral ML issues)

  • try to model enhancer interaction with konwn enhancers, expressed genes, Hi-C data (TargetFinder)
  • common issues:
    • balanced training data (genomic data usually isn’t)
    • select features from whole dataset (feature selection should be inside cross validation)
    • use biologically appropriate examples
    • use the right statistic (PR, ROC, etc?)
    • most ML models assume observations are independent and identically distributed
    • genomic data is inherently not independent and identically distributed
    • not trivial….different genome regions are inherently different (model doesn’t transfer)

Sri Kosuri (Kosuri)

Functional Impacts Of Rare Genetic Variation On Exome Recognition

Experimental assays (but huge)

multiplexed reporter assays (100k+ experiments) + DNA synthesis to do reverse genomics

https://www.biorxiv.org/content/early/2018/03/10/199927

Average genome has 4-5 million common variants and 50-100K rare variants (<0.5% AF)

how do rare variants exert large effect changes?

  • you see them in GTEx (under and over expression outliers)

Massively parallel exon splicing reporter

  • which mutations cause exon splicing issues
  • assayed 28k across 2,200 exon
  • mostly at splice junctions
  • but you still see plenty in exons and further away
  • in silico predictions are pretty bad (splicing specific models OK)
  • 3.8% (1,505 / 27,900) led to large loss of exon recognition
  • can’t predict them…. (with existing tools)
  • nice to see a whole slide on why this isn’t the most perfect experiment ever
    • short (100nt) exons used (biased)

Patrick Brennan (Nationwide Children’s)

Integration of whole genome, whole exome, and transcriptome pipeline for comprehensive genomic profiling of 60 pediatric cancer subjects

250x WES, 60X WGS (tumor + normal) RNA-seq (tumor only)

7 diff fusion detection algorithms

  • ugh, 7 way venn
  • they focus on the few which are consistently called (or rather ranked on overlap and evidence support)

RNA-seq

  • DESeq, Salmon, then Ingenuity for pathway analysis (TPM as input)

Differential comparison (RNA)

  • but to what???
  • oh the ‘treehouse’ data (not sure what this is)

Case study matching indel in gene with RNA expression difference, which helped to suggest tyrosine kinase inhibitor treatment

Case study IDing fusion gene between MET and RBPMS (chr 7 <-> chr 8)

Keep using ‘Churchill’ term for their software….annoying they don’t credit the tools that actually power it (GATK?)

Kaitlin Samocha (Hurles)

Evaluating the role of rare variation in children with developmental disorders

DDD

  • represent UK well
  • diagnostic rate of 35%

4,300 trios have 94 genes with high gene burden

  • gene test with poisson tests outputting p values for LoF, missense, synonymous

New approach

  • weight different types of mutations separately
  • gene score composed of weighted scores
  • multiple hypothesis correction (use prior knowledge)

where do weights come from?

  • observations from DDD cohort, comparing DDD enrichment by class (LoF, etc) to CADD

new method

  • finds more genes relative to poisson (old) method
  • applying to 10k trios

inherited variation

  • do cases have more rare damaging variants
  • cadd > 21
  • observed once or more ExAC
  • effect size of ~1.7 for probands with affect parents
  • ~1.2 for other probands

causative enriched in high pLI genes

is this from cryptic recessive diagnoses?

  • probably not
  • high pLI genes not enriched for recessive disease genes

digenic inheritance?

  • parents each have damaging mutations (but dif from each other)
  • probably not (no strong enrichment in undiagnosed vs diagnosed)

Tracy Ballinger (Semple)

Modeling Double Strand Breaks to Interrogate Structural Variation in Cancer

SV (structural variation) prevalent across cancer

Nonrandom distribution

  • chromatin structure, sequence composition, and nuclear organization influence SV

built Random Forest model that correlates well with real data

  • Replication seq, open chromatin (DNase), pol2db, most important features
  • predicts well across different cell types

Compare SV prediction to cancer breakpoints

  • ID regions with more or less SV breakpoints than expected
  • also “passenger” regions which are not enriched

cancer ‘coldspots’:

  • enriched for active chromatin and deplted in fragile sites

Lucia Spangenberg (Naya)

Retrieving Charras genomic tracts from the Uruguayan admixed population

Charras all killed by state 187 years ago

80 Uruguayan genomes (Just 10 for this project?)

Q1: Still have Charrua in Uruguayan genomes?

Calculated local ancestry (rel to EUR or AFR)

PCA to see clustering of Native Uruguayan. Also had 20 genomes from Simons

Want to calculate Fst Tree with other ancient/original ameican populations

  • clusters with Guarani and Chane (expected!)

Build protogenome by merging the 10 WGS they did

  • 99% covered

Comparative, Evolutionary, Metagenomics

Ellen Leffler (Kwiatkowski)

Paired sequencing of host and parasite genomes in severe malaria cases

191 million cases / year!

plasmodium falciparum (most commmon in Africa) and vivax (more common in not Africa)

MalariaGEN

  • GWAS for severe malaria cases (falciparum), matched population controls
  • found new structural var in GYPA/B/E

Big Questions:

  • parasite variation?
  • genotype <-> parasite susceptibilty?
  • co-evolution?
  • individual disease risk?

can we sequence parasite DNA from human blood samples?

  • 1 - 5 million parasites per ul blood
  • napkin math gives 80X+ coverage for over 10,000 parasites / ul

align WGS to human and parasite genome

  • mean 197X coverage
  • eyeball median of 100X?

selective amplification of plasmodium falciparum gneome which gives ~50X enrichment

1.5 million SNPs total (parasite) 10 million SNPs total (human)

host-parasite association (looking at merozoite SNPs, which interacts with human)

  • get one nice looking peak in parasite gene MSP4 / MSP2 and human GCTN2 and GBGT1

Luca Penso-Dolfin (Di Palma)

The role of structural variants in the adaptive radiation of African Cichlids

African cichclids are canonical example of adaptive radition

Also, are freshwater fish, if you didn’t know

Hundreds of species, variation in jaw morphology, color, vision, behavior

Most studies focus on gene evolution

Looking at structural variation across three great East African lakes

M. zebra, P. nyererei, A. burtoni, N. brichardi (already sequenced)

One outgroup (O. niloticus)

Talking about SV identification using paired end reads and looking for internal distance and split reads

  • Is he rolling his own SV caller???

FIRST NO_TWITTER!

  • SV pipeline….which uses existing SV tools….not certain why NO_TWITTER…but I won’t give any details

duplication net gain of 9.2 events per million years across the four species

lot fewer inversions

  • about 0.5 events per million years

SECOND NO_TWITTER

  • cool enrichment of some genes / GO terms in inversions

Mario Caccamo (NIAB)

Understanding the genetic components controlling apomixis

WHAT IS APOMIXIS I HAVE NO IDEA

SOMETHING WITH CROPS?

Plants often have polyploid genomes

Wheat is hexaploid …. and 17Gb (reminder human is 3Gb)

There are 60k wheat lines. (Whoah)

Apomixis!

  • clonal reproduction through seeds
  • asexual reproduction

Weeping lovegrass

  • grass from southern africa
  • polyploid
  • transcriptome from sexual and asexual
    • 4 candidate genes

Make diploid -> pacbio + dovetail -> assembly

OH NO CRAZY VENN DIAGRAM USE AN UpSet YO

Then sequenced tetraploid and the asexual types and mapped back to assembly

4 candidate genes missing in 2x assembly

  • hypothesis is that they get dropped / incompatible with sexual state
  • no mention of the genes
  • functional characteriziation work in arabidopsis and rice in progress

Carla Cummins (Flicek)

Comparative analysis of hundreds of vertebrate genomes in Ensembl

What is Ensembl?

Data won’t stop coming

  • Ensembl release 94
  • 154 vertebrate genomes
  • clade based analysis
  • huge jump: 41 teleosts this release

Bottlenecks:

  • all vs all methods scale quadratically
    • so try to remove redundancies
    • work more within in clades
  • read/write speeds
  • multiple sequence alignment
    • do on server (5x faster)
  • humans
    • automate manual tasks
    • infer species tree with Mash and NJ

workflows - rule based (XML) - infer parameters

get more algorithms on linear growth - ProgressCactus (?) - https://github.com/glennhickey/progressiveCactus ?

Related

comments powered by Disqus