#GI2019 2019 11 07

Day 2 - Session 2 - SEQUENCING ALGORITHMS, VARIANT DISCOVERY AND GENOME ASSEMBLY

Genome Informatics 2019 at CSHL

Bold is the speaker

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

Genomic sketching with HyperLogLog

Daniel N. Baker, Ben Langmead.

Slides at…crap missed it. Email Ben I guess?

Sketching

  • algorithm to collapse genomes into summary info
  • fasta -> kmers (“shingle”) -> sample (smartly)
  • showing set relationships
  • the “minimum” of a set of sketches gives you useful info on the range you would expect

Unions and Intersections

  • space of possible coincidences requires multiple samples to get useful info
  • bottom 3?
  • or minimum in 3 partitions?

MinHash uses bottom-k approach (Ondov / Phillipy)

Tremendous talk - I feel like I understand what is happening. Invite Ben to talk at your place…I’d imagine you’d get a great talk.

OK, so take log log minimum

Daniel Baker wrote:

  • bit.ly/dash_pre
  • github.com/dnbaker/dashing
  • github.com/dnbaker/sketch

HyperLogLog *HLL)

  • k partition
  • log2n
  • exponent
  • average, bias correction

Highly vectorized (for speed! Massively parallel calculation)

  • only on intel?

Comparing to MinHash

  • HLL deals with lopsided sets better (bottom k not ideal, better minimum k partition)
  • Uses WAY less memory and wall clock

Future stuffs

  • multi-k (kmer lengths)
  • weighted jaccard

centroFlye—Assembling centromeres with long error-prone reads

Andrey Bzikadze, Pavel A. Pevzner.

@AndreyBzikadze

biorxiv preprint (missed link)

brings up t2t consortium (telomere to telomere)

github.com/nanopore-wgs-consortium/CHM13

mentions karen miga’s chrX t2t biorxiv preprint

centromere:

  • assembler’s favoriate region
    • probably not, but keeps them gainfully employed for now
  • highly repetitive….etc etc.
  • 3% of genome - and because are unassembled, poor idea how they influence human diseasse

centroFlye

  • classify reads (prefix…something…suffix… what is going on? Andrey already moved on)
    • guessing something to do with begining / middle / end of repeat?
  • find rare k-mers that can be used to anchor assembly
  • but rare k-mers are usually errors
    • if the distance between two rare k-mers is conserved then they are likely real
    • neat idea!
    • but I’m not certain what conservation is…
  • throw everything away except for rare k-mers

centroFlye used to improve chrX centromere assembly

  • greater length
  • but how do you more robustly define success?
  • one metric is to count shared (or discrepant) k-mers between assembly and a read
  • discorance (a,b) = sharedRead(assembly A) - sharedRead(assembly B)

Genotyping structural variants in pangenome graphs using the vg toolkit

Jean Monlong, Glenn Hickey, David Heller, Jonas Andreas Sibbesen, Jouni Siren, Jordan Eizenga, Eric T. Dawson, Erik Garrison, Adam Novak, Benedict Paten.

Yo time for graphs!

Intro on value over linear genomes

  • highly polymorphic regions should have better geontyping
  • points out graphs not for discovery but for genotyping

Increasing n of SV catalogs with long read data

  • HGSVC (3 samples)
  • SVPOP (15)
  • GiaB (1)

Probably lots of missing SV from 50-500bp (cmparing HGSVC against gnomadSV which is short-read based)

github.com/vgteam/vg

vg toolkit can make graphs, map reads, and call variants

“can we genotype SVs from short-read datasets with vg toolkit”

Build graph with long-read data, genotype SVs against it with short-reads

R package to evaluate calls

  • github.com/jmonlong/sveval

with simulated data

  • vg vs paragraph vs bayestyper vs delly vs SVTyper (last two are non-graph tolls)
  • basestyper does best (not their tool)
  • f1 score

with real data from HGSVC

  • all do worse

simple repeat regions are hard to genotype

  • line, sine, alu,
  • for both insertion/deletion

hard to work with vcf…representation confusion with equivalent representation and oversimplification

  • just use the assemblies?!
  • testing with yeast right now

  • vg performance best/near best

Rapidly mapping raw nanopore signal with UNCALLED to enable real-time targeted sequencing

Sam Kovaka, Yunfan Fan, Winston Timp, Michael C. Schatz.

Don’t want unwanted DNA sequence

  • especially for low throughput sequencers (like ONT)
  • targeted seq techniques not ideal for ONT
    • length not enough
    • erase DNA info (methylation, etc.)

ReadUntil

  • selectively start/stop
  • davemcg: uh, does this actually work yet?
    • was announced a year ago or more??

UNCALLED

  • utility for nanopore current alignment to large expanses of DNA
  • novel streaming algorithm which maps raw nanopore signale in real-time
  • works with raw nanopore output (electrical signal)
  • discussing kmer matching to nanopore output
    • it looks…hard
  • insane slide discussing algorith/implementation
    • i am not equipped to summarize

ReadUntil - enrichment (ejcet a read if it does not map) - depletion (the converse)

Longer reads get more enrichment

  • saves on seq cost and shorter reads hard to map

Testing with bacteria

  • 4.5x enrichment of on-target
  • 0.4x off-target

Doing a human cancer “panel” for SV

  • 28 genes
  • overall 3X enrichment
  • and can assemble the genes with much higher success rate

Future

  • want to improve yield
  • is ssDNA “knotting” and blocking ejection?
  • better API from ONT?

github.com/skovaka/UNCALLED

The construct and utility of reference pan-genome graphs

Heng Li.

pangenome = collection of genomes

  • graph (collapse similar seq)
  • or
  • compressed full-text index

10 years ago (review with Nils Homer): “alignment against multiple genomes will become increasingly important”

“hasn’t happened yet”

vcf doesn’t handle graphs

GFP format (assembly format)

  • davemcg note: looks like a directed(?) acyclic graph
  • not good because if you split a segment, the coordinate changes

Proposal:

  • reference GFA (GFA with tags that have coord info and version)
  • start with GRCh38, incrementally add other genomes
  • blacklist and decoy seq for linear tools
  • updates to preserve coords

Incrementally add new assemblies

  • add assembly, make graph
  • then add another assembly, make new graph

Discussion of alignment linear seq to graph…which I don’t quite follow

  • but ideal approach too slow
  • so approximate with k shortest paths

minigraph

  • based on minimap2
  • limitations!
    • doesn’t work with dense graphs (too many k paths?)
  • 1.5 hours over 24 CPUs with human graph of 20 haplotypes
    • 36k bubbles
    • 94% of GRCh37 is invariable
  • https://github.com/lh3/minigraph (didn’t give link but it’s on his github)

multi-allelic regions and minisatellites are hard to assemble and genotype

applications

  • blacklist regions (SV, etc)

PRINCESS — A framework for comprehensive detection and phasing of SNPs and structural variants

Medhat Mahmoud, Winston Timp, Fritz J. Sedlazeck.

Burn. Winston Timp not on the slide.

@MedhatHelmy7

Points out there are diff tech to detect SNV, SV, Phasing, Methylation

Would prefer just use one platform

  • ONT?

PRINCESS

  • framework to integrate tools to analyze Long Reads
  • mapping: minimap2 and/or NGMLR
  • SNVs: Clair (deep NN) - first NN mention?
  • SVs: Sniffles
  • Phasing: WhatsHap, Princess-subtools
  • can also use trio info (SNPs)
  • Nanopolish for Methyl C
  • outputs….statistics!
  • runs on workstation or hpc or cloud
  • outputs phase SNV, SV with optional MethylC

Benchmarking with GiaB HG002

  • PacBio CCS, PacBio CLR, ONT
  • LOOOOOONG tail for ONT (read length dist)
  • remove <500bp in length
  • try diff coverage (10x, 25x, 50x, 95x)
  • with one tech (70-80% sensitivity, 85-93% precision for SNP)
    • PacBio CCS > CLR > ONT
  • SV calling
    • CCS > CLR == ONT (80% sensitivity, 85-95% precision)
  • Phasing
    • Use parental SNPs
    • N50 better with ONT (loooong reads)
    • but better accuracy with PacBio

Now real data

  • Common Disease Genomics
  • focus on cardiovascular and … something else missed it
  • 44k+ short read WGS -> do some with long read -> merge to make comprehensive genome -> which can be used to improve catalogues
  • SVcollector (github.com/fritzsedlazeck/SVCollector)
  • then Princess (github.com/MeHelmy/princess)
  • then Paragraph (github.com/Illumina/paragraph)
    • uh, what is paragraph
    • from github: " A graph-based structural variant genotyper for short-read sequence data."
    • oh, that makes sense
    • so use 44k short reads on graph?

Efficient chromosome-scale haplotype-resolved assembly of human individuals

Shilpa Garg, Arkarachai Fungtammasan, Anthony Schmitt, Andrew Carroll, Paul Pelusol, Emily Hatas, Fritz Sedlazeck, Justin Zook, Mike Chou, John Aach, Jason Chin, Heng Li, George Church.

Discussing PacBio HiFi tech

  • circularize DNA
  • 15k long
  • 99% with 5 passes

Now Hi-C discussion/intro

  • wild how common Hi-C is. I guess I’m getting old - in grad school this stuff was just IMPOSSIBLE to do

now genome assembly discussion

  • graph geneeration …. getting longer reads gives you fewer branches on your graph (contiguous assembly)

PacBio alone not enough for full assembly/phasing

  • but can use trios with Koren/Phillipy approach (at least works in goats)
    • trio canu
  • but want full haplotypes without trios

workflow

  • PacBio CCS
  • make contigs
  • use Hi-C to scaffold
  • peregrine -> hirrise -> deepvariant -> whatshap + hapcut2 => whatshap -> peregrine
  • walltime: one day

benchmark with PGP1, HG002, NA12878

  • contig n50, NGA50, phasing hamming error, phasing switch error
  • compare against trio canu
  • generally better than trio canu across all metrics
  • consistent across diff genomes

checking in HLA region

  • can build HLA in two contigs

Link!

  • github.com/shilpagarg/WHdenovo

Utilization of an ensemble approach for identification of driver fusions in pediatric cancer

Stephanine LaHaye, Kyle Voytovich, James Fitch, Natalie Bir, Sean D. McGrath, Anthony Miller, Amy Wetzel, Vincent Magrini, Catherine E. Cottrell, Elaine R. Mardis, Richard K. Wilson.

Last talk of the session!

ID fusions in cancer (pediatric)

Interesting….pediatric cancer have fwer mutations (which I guess makes sense)

  • SN Grobner et al Nature 2018
  • fusions are helpful to classify cancers
  • difficult via RNA-seq

Used 4 (6?) different callers for an ensembl approach

  • oh god a multi venn diagram
  • USE AN UpSET plot!!!!
  • anyways, not much agreement (but low n is a feature if they are true)
  • not clear how they actually know most are false positive
  • specificity != sensitivity

LOL Runtimes

  • vary from 30 minutes to 75 hours
  • dragen the fast one

Synthetic fusion gene cDNA for benchmarking

  • more dilution == harder to detect
  • 1:50 dilution == OK, 1:250 == not so good

Running this crazy big complicated pipeline in AWS serveeless config

  • looks like they are paying AWS a lot of $

Related

comments powered by Disqus