- Day 2 - Session 2 - SEQUENCING ALGORITHMS, VARIANT DISCOVERY AND GENOME ASSEMBLY
- Genomic sketching with HyperLogLog
- centroFlye—Assembling centromeres with long error-prone reads
- Genotyping structural variants in pangenome graphs using the vg toolkit
- Rapidly mapping raw nanopore signal with UNCALLED to enable real-time targeted sequencing
- The construct and utility of reference pan-genome graphs
- PRINCESS — A framework for comprehensive detection and phasing of SNPs and structural variants
- Efficient chromosome-scale haplotype-resolved assembly of human individuals
- Utilization of an ensemble approach for identification of driver fusions in pediatric cancer
Day 2 - Session 2 - SEQUENCING ALGORITHMS, VARIANT DISCOVERY AND GENOME ASSEMBLY
Genome Informatics 2019 at CSHL
Bold is the speaker
If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.
Genomic sketching with HyperLogLog
Daniel N. Baker, Ben Langmead.
Slides at…crap missed it. Email Ben I guess?
Sketching
- algorithm to collapse genomes into summary info
- fasta -> kmers (“shingle”) -> sample (smartly)
- showing set relationships
- the “minimum” of a set of sketches gives you useful info on the range you would expect
Unions and Intersections
- space of possible coincidences requires multiple samples to get useful info
- bottom 3?
- or minimum in 3 partitions?
MinHash uses bottom-k approach (Ondov / Phillipy)
Tremendous talk - I feel like I understand what is happening. Invite Ben to talk at your place…I’d imagine you’d get a great talk.
OK, so take log log minimum
Daniel Baker wrote:
- bit.ly/dash_pre
- github.com/dnbaker/dashing
- github.com/dnbaker/sketch
HyperLogLog *HLL)
- k partition
- log2n
- exponent
- average, bias correction
Highly vectorized (for speed! Massively parallel calculation)
- only on intel?
Comparing to MinHash
- HLL deals with lopsided sets better (bottom k not ideal, better minimum k partition)
- Uses WAY less memory and wall clock
Future stuffs
- multi-k (kmer lengths)
- weighted jaccard
centroFlye—Assembling centromeres with long error-prone reads
Andrey Bzikadze, Pavel A. Pevzner.
@AndreyBzikadze
biorxiv preprint (missed link)
brings up t2t consortium (telomere to telomere)
github.com/nanopore-wgs-consortium/CHM13
mentions karen miga’s chrX t2t biorxiv preprint
centromere:
- assembler’s favoriate region
- probably not, but keeps them gainfully employed for now
- highly repetitive….etc etc.
- 3% of genome - and because are unassembled, poor idea how they influence human diseasse
centroFlye
- classify reads (prefix…something…suffix… what is going on? Andrey already moved on)
- guessing something to do with begining / middle / end of repeat?
- find rare k-mers that can be used to anchor assembly
- but rare k-mers are usually errors
- if the distance between two rare k-mers is conserved then they are likely real
- neat idea!
- but I’m not certain what conservation is…
- throw everything away except for rare k-mers
centroFlye used to improve chrX centromere assembly
- greater length
- but how do you more robustly define success?
- one metric is to count shared (or discrepant) k-mers between assembly and a read
- discorance (a,b) = sharedRead(assembly A) - sharedRead(assembly B)
Genotyping structural variants in pangenome graphs using the vg toolkit
Jean Monlong, Glenn Hickey, David Heller, Jonas Andreas Sibbesen, Jouni Siren, Jordan Eizenga, Eric T. Dawson, Erik Garrison, Adam Novak, Benedict Paten.
Yo time for graphs!
Intro on value over linear genomes
- highly polymorphic regions should have better geontyping
- points out graphs not for discovery but for genotyping
Increasing n of SV catalogs with long read data
- HGSVC (3 samples)
- SVPOP (15)
- GiaB (1)
Probably lots of missing SV from 50-500bp (cmparing HGSVC against gnomadSV which is short-read based)
github.com/vgteam/vg
vg toolkit can make graphs, map reads, and call variants
“can we genotype SVs from short-read datasets with vg toolkit”
Build graph with long-read data, genotype SVs against it with short-reads
R package to evaluate calls
- github.com/jmonlong/sveval
with simulated data
- vg vs paragraph vs bayestyper vs delly vs SVTyper (last two are non-graph tolls)
- basestyper does best (not their tool)
- f1 score
with real data from HGSVC
- all do worse
simple repeat regions are hard to genotype
- line, sine, alu,
- for both insertion/deletion
hard to work with vcf…representation confusion with equivalent representation and oversimplification
- just use the assemblies?!
testing with yeast right now
vg performance best/near best
Rapidly mapping raw nanopore signal with UNCALLED to enable real-time targeted sequencing
Sam Kovaka, Yunfan Fan, Winston Timp, Michael C. Schatz.
Don’t want unwanted DNA sequence
- especially for low throughput sequencers (like ONT)
- targeted seq techniques not ideal for ONT
- length not enough
- erase DNA info (methylation, etc.)
ReadUntil
- selectively start/stop
- davemcg: uh, does this actually work yet?
- was announced a year ago or more??
UNCALLED
- utility for nanopore current alignment to large expanses of DNA
- novel streaming algorithm which maps raw nanopore signale in real-time
- works with raw nanopore output (electrical signal)
- discussing kmer matching to nanopore output
- it looks…hard
- insane slide discussing algorith/implementation
- i am not equipped to summarize
ReadUntil - enrichment (ejcet a read if it does not map) - depletion (the converse)
Longer reads get more enrichment
- saves on seq cost and shorter reads hard to map
Testing with bacteria
- 4.5x enrichment of on-target
- 0.4x off-target
Doing a human cancer “panel” for SV
- 28 genes
- overall 3X enrichment
- and can assemble the genes with much higher success rate
Future
- want to improve yield
- is ssDNA “knotting” and blocking ejection?
- better API from ONT?
github.com/skovaka/UNCALLED
The construct and utility of reference pan-genome graphs
Heng Li.
pangenome = collection of genomes
- graph (collapse similar seq)
- or
- compressed full-text index
10 years ago (review with Nils Homer): “alignment against multiple genomes will become increasingly important”
“hasn’t happened yet”
vcf doesn’t handle graphs
GFP format (assembly format)
- davemcg note: looks like a directed(?) acyclic graph
- not good because if you split a segment, the coordinate changes
Proposal:
- reference GFA (GFA with tags that have coord info and version)
- start with GRCh38, incrementally add other genomes
- blacklist and decoy seq for linear tools
- updates to preserve coords
Incrementally add new assemblies
- add assembly, make graph
- then add another assembly, make new graph
Discussion of alignment linear seq to graph…which I don’t quite follow
- but ideal approach too slow
- so approximate with k shortest paths
minigraph
- based on minimap2
- limitations!
- doesn’t work with dense graphs (too many k paths?)
- 1.5 hours over 24 CPUs with human graph of 20 haplotypes
- 36k bubbles
- 94% of GRCh37 is invariable
- https://github.com/lh3/minigraph (didn’t give link but it’s on his github)
multi-allelic regions and minisatellites are hard to assemble and genotype
applications
- blacklist regions (SV, etc)
PRINCESS — A framework for comprehensive detection and phasing of SNPs and structural variants
Medhat Mahmoud, Winston Timp, Fritz J. Sedlazeck.
Burn. Winston Timp not on the slide.
@MedhatHelmy7
Points out there are diff tech to detect SNV, SV, Phasing, Methylation
Would prefer just use one platform
- ONT?
PRINCESS
- framework to integrate tools to analyze Long Reads
- mapping: minimap2 and/or NGMLR
- SNVs: Clair (deep NN) - first NN mention?
- SVs: Sniffles
- Phasing: WhatsHap, Princess-subtools
- can also use trio info (SNPs)
- Nanopolish for Methyl C
- outputs….statistics!
- runs on workstation or hpc or cloud
- outputs phase SNV, SV with optional MethylC
Benchmarking with GiaB HG002
- PacBio CCS, PacBio CLR, ONT
- LOOOOOONG tail for ONT (read length dist)
- remove <500bp in length
- try diff coverage (10x, 25x, 50x, 95x)
- with one tech (70-80% sensitivity, 85-93% precision for SNP)
- PacBio CCS > CLR > ONT
- SV calling
- CCS > CLR == ONT (80% sensitivity, 85-95% precision)
- Phasing
- Use parental SNPs
- N50 better with ONT (loooong reads)
- but better accuracy with PacBio
Now real data
- Common Disease Genomics
- focus on cardiovascular and … something else missed it
- 44k+ short read WGS -> do some with long read -> merge to make comprehensive genome -> which can be used to improve catalogues
- SVcollector (github.com/fritzsedlazeck/SVCollector)
- then Princess (github.com/MeHelmy/princess)
- then Paragraph (github.com/Illumina/paragraph)
- uh, what is paragraph
- from github: " A graph-based structural variant genotyper for short-read sequence data."
- oh, that makes sense
- so use 44k short reads on graph?
Efficient chromosome-scale haplotype-resolved assembly of human individuals
Shilpa Garg, Arkarachai Fungtammasan, Anthony Schmitt, Andrew Carroll, Paul Pelusol, Emily Hatas, Fritz Sedlazeck, Justin Zook, Mike Chou, John Aach, Jason Chin, Heng Li, George Church.
Discussing PacBio HiFi tech
- circularize DNA
- 15k long
- 99% with 5 passes
Now Hi-C discussion/intro
- wild how common Hi-C is. I guess I’m getting old - in grad school this stuff was just IMPOSSIBLE to do
now genome assembly discussion
- graph geneeration …. getting longer reads gives you fewer branches on your graph (contiguous assembly)
PacBio alone not enough for full assembly/phasing
- but can use trios with Koren/Phillipy approach (at least works in goats)
- trio canu
- but want full haplotypes without trios
workflow
- PacBio CCS
- make contigs
- use Hi-C to scaffold
- peregrine -> hirrise -> deepvariant -> whatshap + hapcut2 => whatshap -> peregrine
- walltime: one day
benchmark with PGP1, HG002, NA12878
- contig n50, NGA50, phasing hamming error, phasing switch error
- compare against trio canu
- generally better than trio canu across all metrics
- consistent across diff genomes
checking in HLA region
- can build HLA in two contigs
Link!
- github.com/shilpagarg/WHdenovo
Utilization of an ensemble approach for identification of driver fusions in pediatric cancer
Stephanine LaHaye, Kyle Voytovich, James Fitch, Natalie Bir, Sean D. McGrath, Anthony Miller, Amy Wetzel, Vincent Magrini, Catherine E. Cottrell, Elaine R. Mardis, Richard K. Wilson.
Last talk of the session!
ID fusions in cancer (pediatric)
Interesting….pediatric cancer have fwer mutations (which I guess makes sense)
- SN Grobner et al Nature 2018
- fusions are helpful to classify cancers
- difficult via RNA-seq
Used 4 (6?) different callers for an ensembl approach
- oh god a multi venn diagram
- USE AN UpSET plot!!!!
- anyways, not much agreement (but low n is a feature if they are true)
- not clear how they actually know most are false positive
- specificity != sensitivity
LOL Runtimes
- vary from 30 minutes to 75 hours
- dragen the fast one
Synthetic fusion gene cDNA for benchmarking
- more dilution == harder to detect
- 1:50 dilution == OK, 1:250 == not so good
Running this crazy big complicated pipeline in AWS serveeless config
- looks like they are paying AWS a lot of $