- Day 1 - Session 1 - GENOME STRUCTURE AND FUNCTION
- Comparative 3D genome organization in Apicomplexan parasites
- Unscrambling the tumor genome via integrated analysis of structural variation and copy number
- Tissue-specific enhancer functional networks for associating distal regulatory regions to disease
- Targeted Nanopore sequencing with Cas9 for studies of methylation, structural variants, and mutations
- Long-read sequencing of structurally variant genomes
- Exploring the 3D spatial dependency of gene expression using Markov random fields
- Mapping cis-eQTL from RNA-seq data with no genotypes
- Exploring short tandem repeat expansions at both known and novel loci in the human genome
Day 1 - Session 1 - GENOME STRUCTURE AND FUNCTION
Genome Informatics 2019 at CSHL
Bold is the speaker
349 attendees. Largest?
189 posters
39 (?) talks
90 countries, 20 (US) states
High(est) % female, URM for gi
If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.
Comparative 3D genome organization in Apicomplexan parasites
Evelien M. Bunnik, Aarthi Venkat, Ferhat Ay, Karine G. Le Roch.
Apicomplexan parasites:
- toxoplasma gondii
- babesia
- most prevalent nad morbidity causing pathogens
Try to understand at genomic level
Malaria (plasmodium) genome published 18 years ago
- 82% AT rich
- grab bag of major -omics techniques
- still poor understanding of gene regulation
- believe under-representation of TFs identified (lower rate than human, yeast)
Did Hi-C (chrmatin conformation capture)
- at many stages
- 100kb resolution (at least for later work?)
- going through their literature of chromosome organization (Ay, Bunnik Genome Research 2014)
Virulence Genes
- mediates adherence of infected RGC to endothelium of infected RBC
- altered chromatin structure at some of these genes
More examples of loops for other pathogenesis processes
- see power of doing many stages (stage specific patterning of contact maps)
Did more Hi-C across other parasites, unfortunately had to re-assemble several genomes
- Bunnik PNAS 2019
- more expression at genes near (in 3D space) telomore
- which is where many virulence genes are …
Unscrambling the tumor genome via integrated analysis of structural variation and copy number
Daniel L. Cameron, Charles Shale, Jonathan Baber, Anthony T. Papenfuss, Peter Priestly.
Hartwig has a fully automated oncology report (“no touch”)
- needs to be highly sensitive and specific
- use 110x tumor, 45x germline
- SV and CNV calling not ideal
GRIDSS -> PURPLE -> LINX
SV-> CNV -> Interpretation
https://github.com/PapenfussLab/gridss
full pipeline: https://github.com/hartwigmedical/gridss-purple-linx
work on phasing SV (cis vs trans vs unknown)
- only good up to about 300bp
- but events are clustered, so works better than you’d expect
Benchmarking (includes GRIDSS)
Tissue-specific enhancer functional networks for associating distal regulatory regions to disease
Xi Chen, Jian Zhou, Chandra L. Theesfeld, Ran Zhang, Aaron K. Wong, Olga G. Troyanskaya.
How to predict function of enhancers?
Different ways enhancers associated with disease:
- enhancer -> gene (near)
- enhancer(with SNP inside)
- enhancer(rare mutation inside)
- enhance without direct evidence
Hypothesis:
- enhancer will have a functional relationship with another enhancer if:
- 3d interaction
- co-activation with TF
- regulate related genes
- if all met, then maybe disease associated
bayesian inference to model the above assumptions
how to see if this sort of works?
- check brain specific super enhancers in non-brain tissues
- get some enrichment
https://fenrir.flatironinstitute.org
- web tool
- no preprint or paper?
dmcg note:
- seems like you’d really want some en-masse validation. validating a few sites doesn’t seem like enough.
- i missed what the input data for these models are (public stuff? Internal stuffs?)
Targeted Nanopore sequencing with Cas9 for studies of methylation, structural variants, and mutations
Timothy E. Gilpatrick, Isac Lee, Fritz J. Sedlazek, Winston Timp.
Want to enrich regions for genome
ROI -> cut out with CRISPR (cas9 on both sides of ROI) -> dA tail -> adapter ligate
- oh even cutting just once helps with enrichment. cool.
- can target 10 sites simulateously (400X+ coverage with MinION)
Tried with cancer model
On-target perentage (reads) is about 1-4%
- most off-target randomly distributed
- very little off-target cutting
5% error rate!
- homopolymers sucks
- guppy v3 does better job at SNV calling
did downsampling test to check SNV performance
- tools
- mpileup
- Nanopolish (Simpson et al. ?)
- Medaka (ONT)
- Clair (Luo et al.)
- can get 95% sesnitivity…doesn’t seem to great
- Clair gets worse with more coverage (trained on 25X apparently)
HE IS GOING SO FAST OMG
More discussion of benchmarking - still in SNV calling
Now we are talking phasing
- in TP53 find all true variants, no FP
Suggestion of CNV with phased data, more copies of one allele (60X vs 20X)
Now CpG
- enriched stuff can still get detectable CpG calls
SV & Repetitive
- still works
dmcg note:
- everything still works (isn’t that expected?)
- on-target percentage VERY low (1-4% means you are “throwing” lots of data away)
- and yield is getting hammered…so doubly bad
- not doing anything like getting hundreds of genes…which I would think would be really necessary to be useful clinically
- size smaller? (length of DNA reads)
Long-read sequencing of structurally variant genomes
Evan E. Eichler.
4% duplication
49% of normal CNV maps to this 4% of genome
Long reads increase resolution of SV
SV detection comaprison between Illumina and PacBio (30k with PB, 11k with Illumina)
- 70% of SV missed with just Illumina data
Did SV analysis with LR seq method across diverse human genomes
- found near 100k resolved SV
- exciting to see a plateau of novel SV! (east africa most unique, as expected)
Improve SV calling with short read data with improved genomes
- can find new eQTL
Goal: t2t (telomere 2 telomere for chromosome)
- t2t consortium
Solution:
- HiFi PacBio 99.9% acccurate 18kbp
- ONT gives you ultra long (whale!) reads
HiFi & Strandseq(?) to build referene-free phased human genomes
hifi & canu -> saarclust & strandseq -> deepvariant -> whatshap & strandseq -> canu/peregrine
assemboy -> contigs -> snp calls -> haplotype CCS reads -> assembly
23 chr (“clusters”)
95% SNP phase
N50 = 28 Mbp
problematic regions
- canu and peregrine fail
- 250 in count
- 60% are supplemental dups
- 18% SV
- 23% acrocentric
solutions
- build graghs around high dup regions of human genome
- “SDA” assembler
Long reads getting centrome seq of chr8 (t2t consortium)
Exploring the 3D spatial dependency of gene expression using Markov random fields
Naihui Zhou, Iddo Friedberg, Mark S. Kaiser.
The quantitative expression of gene regulation could manifest as spatial dependency
Hypothesis:
- TF factories (clustering of genes for active transcription)
- hub-enhancers (shared enhancers for many genes)
Propose adding spatial location of genes to improve modeeling of gene-gene variation (which is already modelled in most diff expression analyses, e.g. limma)
Poisson Hierarchical Markov Random Field Model (PHiMRF)
- uses HiC data to get pairwise gene interactions
- two genes called neighbors if HiC interaction > threshold
- making a spatial gene network
- genes are nodes, edges are HiC interaction
New random variable Yik which follows binomial dist (? missed this)
Do genes in TAD have spatial dependencies?
- but how to deal with interTAD and intraTAD (within and between)?
- run model with both types of edges
- Yes…more connected than bootstraping tests
davemcg note: Still waiting to see if/how this improves diff expression?
Enriched GO terms for most conncted gene groups
- tx by RNA polII
- GPCR signalling
- negative cell proliferation
R package! https://github.com/ashleyzhou972/PhiMRF
Mapping cis-eQTL from RNA-seq data with no genotypes
Elena Vigorito, Simon White, Chris Wallace.
Can’t always get genotypes….so would like eQTL without them….
Impute cis-SNP (Gi)
negative binomial weighted by probable genotype
work in 500kb window to compute gene - SNP pairings
benchmark against GEUVADIS reference data
- only chr22 (why? performance?)
- better sensitivity and specificity
R package out “soon”
only method for eQTL without genotypes
Exploring short tandem repeat expansions at both known and novel loci in the human genome
Harriet Dashnow, Brent Pedersen, Daniel MacArthur, Alicia Oshlack, Aaron Quinlan.
@hdashnow (twitter handle! good!)
STR is 1-6bp repeats. Microsatellites.
3% of human genome with high mutation rate and high polymorphism
Huntington’s, fragile X, spinocerebellar ataxias
Classically, longer is worse prognosis / higher severity
STR alleles are WAY longer than short reads
Harriet looking for NOVEL (str NOT IN REFERENCE GENOME)
- not so many methods fornovel (just ExpansionHunter? as of 2 months ago)
STR reads often mis-mapped
- read STR…where does it go???
- sends it to just whatever STR or nowhere if really novel
STRling!
- https://github.com/quinlan-lab/STRling
- use kmers to find STR
- if read passes kmer threahold, then is a STR … then re-map
- for longer pairs use linear regression to predict allele size
Validated acdross 9 (7 worked) disease expansions
Found 5/5 novel expansions in CANVAS disease
Also validation with PacBio
Limitations! (yay! seriously love this)
- limited to large (>50 bp) STR expansions