- Day 3 - Session 7 - MICROBIAL AND METAGENOMICS
- Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps
- metaFlye—Scalable long-read metagenome assembly using repeat graphs
- Detecting microbial transmission and engraftment after faecal microbiota transplants using long-read metagenomics and reticulatus
- Entropy of a bacterial stress response is a generalizable predictor for fitness and antibiotic sensitivity
- The use of kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data
- Real-time assembly using Nanopore sequencing data for microbial communities
- Exploring the role of ribosomal gene repeats in the context of regeneration
- Genomic epidemiology of West Nile virus in California
Day 3 - Session 7 - MICROBIAL AND METAGENOMICS
Genome Informatics 2019 at CSHL
Bold is the speaker
If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.
My notes will likely reflect my background knowledge of microbes. Which is nil.
Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps
Alexander T. Dilthey, Chirag Jain, Sergey Koren, Adam M. Phillippy.
Discussion of how MinHash works.
kmers, which are hashed, then sorted (take the smallest), compare kmers between query and ref
MinHash ~ Jaccard similarity (point to assess similarity, e.g. “alignment”)
“Mapping qualities”
- want prob. dist. of all mapping locations for a read
- assume
- read error rate known
- compute jaccard similarity
- model as Bernoulli
MetaMaps
- integrated metagenomic classification system
- takes lots from MashMap
- auto-tunes minimum read length & identity (EM)
Benchmark
- precision better than others (Kraken, Kraken2, Centrifuge, MEGAN-LR)
- REQUIRES LONG READS (> 1000bp)
- slower than other methods (and takes more mem)
- but are working on this
Going through some case studies on finding novel things
- find mismatches from errors in ref databases
- location mapping allows you to quantify categories of genes (or I totally misheard/misread)
github.com/DiltheyLab/MetaMaps
metaFlye—Scalable long-read metagenome assembly using repeat graphs
Mikhail Kolmogorov, Mikhail Rayko, Jeffrey Yuan, Evgeny Polevikov, Pavel Pevzner.
Genome Assembly, with graphs
Flye intro: - glue repeats into the graph
OK, I’m not following this. Sorry Mikhail!
Isn’t this a metagenome session? This is all genome assembly.
Oh, he is shifting now.
How to detect repeats?
- in genomes can use coverage
- not reliable enough in metagenomes
- need to use graph
Emphasizes look at graph viz
- want one loop for bacteria chr
- reality sucks
- bubbles oh my
- strain hetergeneity?
- or repeat -> bubble -> repeat?!
- BUBBLES CAN HAVE BUBBLES
Showing cow rumen microbiome
Now Sheep microbiome
Looks like many successes and many crazy things
Emphasizes metaFlye is still very much in dev.
Recommends Bandage (Wick et al) and AGB (Mikheenko & Kolmogorov) for viz of assemblies
Asks us - would you prefer simple consensus or bubbly haplotype resolved assembly?
github.com/fenderglass/Flye
Detecting microbial transmission and engraftment after faecal microbiota transplants using long-read metagenomics and reticulatus
Samuel M. Nicholls, Joshua C. Quick, Tariq H. Iqbal, Nicholas J. Loman.
@samstudio8
Intro on why microbiome is important
- health
- exciting as field isn’t fully define for best practices
“three peaks challenge”
- want extraction method that maintains composition (equal representation of strain?) and long reads
(everyone is so coy about what is a faecal transplant)
anyways….checking human microbiomes pre/post transplant
get looooong reads
showing n50, contig nums, etc. they look uh, big? (don’t have a great sense about what is acceptable)
diving into communities. oooooo very cool plot (bubble plot?) with families of bacteria. so many complete (?) genomes
need to do assembly polishing, but expensive computation
github.com/samstudio8/reticulatus (snakemake pipeline for polishing)
tech discussion of GPU acceleration of steps in pipeline
- 100 minutes -> 5 minutes with GPU acceleration
- 10 - 5 hours on promethION
File parsers!
- poor gzip FQ parser
- port heng li readfq…save 40 minutes
Huge time savings on real data
back to data
- showing diffs pre/post transplantation
- novel communities?
- how do they find these? eyeballing? diff testing?
Entropy of a bacterial stress response is a generalizable predictor for fitness and antibiotic sensitivity
Defne Surujon, Zeyu Zhu, Juan C. Ortiz-Marquez, José Bento, Tim van Opijnen.
Transciptional signatures can seprate straisn (antibiotic sensitive/resistant)
Tough to do diff testing because genes often just flat out missing in some strains
Want a more universal signature
- across strains/conditions
- without relying on previous knowledge of gene function/identity/patterns
Try to predict fitness
- in vitro make high fitness for control (what? how? )
biorxiv.org/content/10.1101/813709v1
3d PCA of strains, colored by (expression?) of major classes of genes
- noticed that high fitness genes had far less variance
gene expression is more chaotic in low fitness strains
can we use entropy? but how do we calculate entropy?
- assume each gene independent (which def. is not correct in real world….but still works in practice)
- calculate variance in gene expression
- then sum all var?
making networks of gene interactions with some parameter tuning (to figure out edge nums/connectivity) to improve modelling
to save time/money/sanity moved over testing to just one timepoint
- big diff in distribution width (log2FC density plot) between low/high fitness
Does this work for many pathogens?
- seems like it
The use of kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data
Lauren Cowley, Jordan Taylor, Claire Jenkins, Tim Dallman.
@laurencowley4
Talking about food borne pathogens
- apparently 50% (!) of UK Salmonella is associated with foreign travel / imported food
Discusses pathogen tracking system in UK
- local doc, epidemiologists
Use RF for k-mer counting to figure out country of origin
fsm-lite used (what is this?)
- kmer counter?
- some kmers mark strains
big ’ol matrix
- 5 million kmers across 500 strains, for example
random forest (RF) intro
- classifier
- non continuous data
- highly resistant to overfitting (bc lots ’o trees)
- interpretability (kmers / features which are most informative for classifier)
train of 2017 data, test on 2018 data
- 71% accuracy for first pass
- only good at domestic/foreign
- non-england countries poorly represented
- decided to aggregate countries by region
- little better
- neural network gets to 80% accuracy with most common 100 kmers
- try with most strains, fewer kmers
- will try unitigs instead of kmers
Real-time assembly using Nanopore sequencing data for microbial communities
Son H. Nguyen, Minh D. Cao, Lachlan J. Coin.
Exploring the role of ribosomal gene repeats in the context of regeneration
Sofia N. Barreira, Andreas D. Baxevanis.
Woo NHGRI (did postdoc there)!
rDNA is highly repetitive in humans
big intergenic spacer (GS)
UBF (nucleolar tf) binds across rRNA
but this is human….baxevanis lab studies hydractinia (highly regenerative marine hydrozoan from phylum cnidaria)
- you can decaptitate, and head regrows in 3 days
are there genomic features that contribute to regeneration?
- got partial 18S and 28S seq from NCBI as well as human 5.8S and aligned to hydractinia genome assembly
- enhancer elements in rRNA?
- going to do chip-seq
what about UBF?
- tough to ID
- looking for emergence across nematodes, ctenophores, cnidaria, arthopods….chordates
- doing a proteome assembly
Genomic epidemiology of West Nile virus in California
Karthik Gangavarapu, Nathaniel Matteson, on behalf of the WestNile 4K Project.
West Nile
- birds <-> mosquito -> human (dead end for west nile, yay)
- showing infection rate across time and US states
- variable infection rates by state
sampling across california (CA) across time
seq -> phylogenetic tree
- 5 clades
- cool flow moving plots showing movement of strains across CA across time
- virus can persist across seasons