#GI2019 2019 11 07

Nov 7, 2019 8 min read bioinformatics, gi2019, conferences, talk

Day 2 - Session 2 - SEQUENCING ALGORITHMS, VARIANT DISCOVERY AND GENOME ASSEMBLY
Genomic sketching with HyperLogLog
centroFlye—Assembling centromeres with long error-prone reads
Genotyping structural variants in pangenome graphs using the vg toolkit
Rapidly mapping raw nanopore signal with UNCALLED to enable real-time targeted sequencing
The construct and utility of reference pan-genome graphs
PRINCESS — A framework for comprehensive detection and phasing of SNPs and structural variants
Efficient chromosome-scale haplotype-resolved assembly of human individuals
Utilization of an ensemble approach for identification of driver fusions in pediatric cancer

Day 2 - Session 2 - SEQUENCING ALGORITHMS, VARIANT DISCOVERY AND GENOME ASSEMBLY

Genome Informatics 2019 at CSHL

Bold is the speaker

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

Genomic sketching with HyperLogLog

Daniel N. Baker, Ben Langmead.

Slides at…crap missed it. Email Ben I guess?

Sketching

algorithm to collapse genomes into summary info
fasta -> kmers (“shingle”) -> sample (smartly)
showing set relationships
the “minimum” of a set of sketches gives you useful info on the range you would expect

Unions and Intersections

space of possible coincidences requires multiple samples to get useful info
bottom 3?
or minimum in 3 partitions?

MinHash uses bottom-k approach (Ondov / Phillipy)

Tremendous talk - I feel like I understand what is happening. Invite Ben to talk at your place…I’d imagine you’d get a great talk.

OK, so take log log minimum

Daniel Baker wrote:

bit.ly/dash_pre
github.com/dnbaker/dashing
github.com/dnbaker/sketch

HyperLogLog *HLL)

k partition
log₂n
exponent
average, bias correction

Highly vectorized (for speed! Massively parallel calculation)

only on intel?

Comparing to MinHash

HLL deals with lopsided sets better (bottom k not ideal, better minimum k partition)
Uses WAY less memory and wall clock

Future stuffs

multi-k (kmer lengths)
weighted jaccard

centroFlye—Assembling centromeres with long error-prone reads

Andrey Bzikadze, Pavel A. Pevzner.

@AndreyBzikadze

biorxiv preprint (missed link)

brings up t2t consortium (telomere to telomere)

github.com/nanopore-wgs-consortium/CHM13

mentions karen miga’s chrX t2t biorxiv preprint

centromere:

assembler’s favoriate region
- probably not, but keeps them gainfully employed for now
highly repetitive….etc etc.
3% of genome - and because are unassembled, poor idea how they influence human diseasse

centroFlye

classify reads (prefix…something…suffix… what is going on? Andrey already moved on)
- guessing something to do with begining / middle / end of repeat?
find rare k-mers that can be used to anchor assembly
but rare k-mers are usually errors
- if the distance between two rare k-mers is conserved then they are likely real
- neat idea!
- but I’m not certain what conservation is…
throw everything away except for rare k-mers

centroFlye used to improve chrX centromere assembly

greater length
but how do you more robustly define success?
one metric is to count shared (or discrepant) k-mers between assembly and a read
discorance (a,b) = sharedRead(assembly A) - sharedRead(assembly B)

Genotyping structural variants in pangenome graphs using the vg toolkit

Jean Monlong, Glenn Hickey, David Heller, Jonas Andreas Sibbesen, Jouni Siren, Jordan Eizenga, Eric T. Dawson, Erik Garrison, Adam Novak, Benedict Paten.

Yo time for graphs!

Intro on value over linear genomes

highly polymorphic regions should have better geontyping
points out graphs not for discovery but for genotyping

Increasing n of SV catalogs with long read data

HGSVC (3 samples)
SVPOP (15)
GiaB (1)

Probably lots of missing SV from 50-500bp (cmparing HGSVC against gnomadSV which is short-read based)

github.com/vgteam/vg

vg toolkit can make graphs, map reads, and call variants

“can we genotype SVs from short-read datasets with vg toolkit”

Build graph with long-read data, genotype SVs against it with short-reads

R package to evaluate calls

github.com/jmonlong/sveval

with simulated data

vg vs paragraph vs bayestyper vs delly vs SVTyper (last two are non-graph tolls)
basestyper does best (not their tool)
f1 score

with real data from HGSVC

all do worse

simple repeat regions are hard to genotype

line, sine, alu,
for both insertion/deletion

hard to work with vcf…representation confusion with equivalent representation and oversimplification

just use the assemblies?!
testing with yeast right now
vg performance best/near best

Rapidly mapping raw nanopore signal with UNCALLED to enable real-time targeted sequencing

Sam Kovaka, Yunfan Fan, Winston Timp, Michael C. Schatz.

Don’t want unwanted DNA sequence

especially for low throughput sequencers (like ONT)
targeted seq techniques not ideal for ONT
- length not enough
- erase DNA info (methylation, etc.)

ReadUntil

selectively start/stop
davemcg: uh, does this actually work yet?
- was announced a year ago or more??

UNCALLED

utility for nanopore current alignment to large expanses of DNA
novel streaming algorithm which maps raw nanopore signale in real-time
works with raw nanopore output (electrical signal)
discussing kmer matching to nanopore output
- it looks…hard
insane slide discussing algorith/implementation
- i am not equipped to summarize

ReadUntil - enrichment (ejcet a read if it does not map) - depletion (the converse)

Longer reads get more enrichment

saves on seq cost and shorter reads hard to map

Testing with bacteria

4.5x enrichment of on-target
0.4x off-target

Doing a human cancer “panel” for SV

28 genes
overall 3X enrichment
and can assemble the genes with much higher success rate

Future

want to improve yield
is ssDNA “knotting” and blocking ejection?
better API from ONT?

github.com/skovaka/UNCALLED

The construct and utility of reference pan-genome graphs

Heng Li.

pangenome = collection of genomes

graph (collapse similar seq)
or
compressed full-text index

10 years ago (review with Nils Homer): “alignment against multiple genomes will become increasingly important”

“hasn’t happened yet”

vcf doesn’t handle graphs

GFP format (assembly format)

davemcg note: looks like a directed(?) acyclic graph
not good because if you split a segment, the coordinate changes

Proposal:

reference GFA (GFA with tags that have coord info and version)
start with GRCh38, incrementally add other genomes
blacklist and decoy seq for linear tools
updates to preserve coords

Incrementally add new assemblies

add assembly, make graph
then add another assembly, make new graph

Discussion of alignment linear seq to graph…which I don’t quite follow

but ideal approach too slow
so approximate with k shortest paths

minigraph

based on minimap2
limitations!
- doesn’t work with dense graphs (too many k paths?)
1.5 hours over 24 CPUs with human graph of 20 haplotypes
- 36k bubbles
- 94% of GRCh37 is invariable
https://github.com/lh3/minigraph (didn’t give link but it’s on his github)

multi-allelic regions and minisatellites are hard to assemble and genotype

applications

blacklist regions (SV, etc)

PRINCESS — A framework for comprehensive detection and phasing of SNPs and structural variants

Medhat Mahmoud, Winston Timp, Fritz J. Sedlazeck.

Burn. Winston Timp not on the slide.

@MedhatHelmy7

Points out there are diff tech to detect SNV, SV, Phasing, Methylation

Would prefer just use one platform

ONT?

PRINCESS

framework to integrate tools to analyze Long Reads
mapping: minimap2 and/or NGMLR
SNVs: Clair (deep NN) - first NN mention?
SVs: Sniffles
Phasing: WhatsHap, Princess-subtools
can also use trio info (SNPs)
Nanopolish for Methyl C
outputs….statistics!
runs on workstation or hpc or cloud
outputs phase SNV, SV with optional MethylC

Benchmarking with GiaB HG002

PacBio CCS, PacBio CLR, ONT
LOOOOOONG tail for ONT (read length dist)
remove <500bp in length
try diff coverage (10x, 25x, 50x, 95x)
with one tech (70-80% sensitivity, 85-93% precision for SNP)
- PacBio CCS > CLR > ONT
SV calling
- CCS > CLR == ONT (80% sensitivity, 85-95% precision)
Phasing
- Use parental SNPs
- N50 better with ONT (loooong reads)
- but better accuracy with PacBio

Now real data

Common Disease Genomics
focus on cardiovascular and … something else missed it
44k+ short read WGS -> do some with long read -> merge to make comprehensive genome -> which can be used to improve catalogues
SVcollector (github.com/fritzsedlazeck/SVCollector)
then Princess (github.com/MeHelmy/princess)
then Paragraph (github.com/Illumina/paragraph)
- uh, what is paragraph
- from github: " A graph-based structural variant genotyper for short-read sequence data."
- oh, that makes sense
- so use 44k short reads on graph?

Efficient chromosome-scale haplotype-resolved assembly of human individuals

Shilpa Garg, Arkarachai Fungtammasan, Anthony Schmitt, Andrew Carroll, Paul Pelusol, Emily Hatas, Fritz Sedlazeck, Justin Zook, Mike Chou, John Aach, Jason Chin, Heng Li, George Church.

Discussing PacBio HiFi tech

circularize DNA
15k long
99% with 5 passes

Now Hi-C discussion/intro

wild how common Hi-C is. I guess I’m getting old - in grad school this stuff was just IMPOSSIBLE to do

now genome assembly discussion

graph geneeration …. getting longer reads gives you fewer branches on your graph (contiguous assembly)

PacBio alone not enough for full assembly/phasing

but can use trios with Koren/Phillipy approach (at least works in goats)
- trio canu
but want full haplotypes without trios

workflow

PacBio CCS
make contigs
use Hi-C to scaffold
peregrine -> hirrise -> deepvariant -> whatshap + hapcut2 => whatshap -> peregrine
walltime: one day

benchmark with PGP1, HG002, NA12878

contig n50, NGA50, phasing hamming error, phasing switch error
compare against trio canu
generally better than trio canu across all metrics
consistent across diff genomes

checking in HLA region

can build HLA in two contigs

Link!

github.com/shilpagarg/WHdenovo

Utilization of an ensemble approach for identification of driver fusions in pediatric cancer

Stephanine LaHaye, Kyle Voytovich, James Fitch, Natalie Bir, Sean D. McGrath, Anthony Miller, Amy Wetzel, Vincent Magrini, Catherine E. Cottrell, Elaine R. Mardis, Richard K. Wilson.

Last talk of the session!

ID fusions in cancer (pediatric)

Interesting….pediatric cancer have fwer mutations (which I guess makes sense)

SN Grobner et al Nature 2018
fusions are helpful to classify cancers
difficult via RNA-seq

Used 4 (6?) different callers for an ensembl approach

oh god a multi venn diagram
USE AN UpSET plot!!!!
anyways, not much agreement (but low n is a feature if they are true)
not clear how they actually know most are false positive
specificity != sensitivity

LOL Runtimes

vary from 30 minutes to 75 hours
dragen the fast one

Synthetic fusion gene cDNA for benchmarking

more dilution == harder to detect
1:50 dilution == OK, 1:250 == not so good

Running this crazy big complicated pipeline in AWS serveeless config

looks like they are paying AWS a lot of $

bioinformatics gi2019 conferences talk