#GI2019 2019 11 08 Morning Session

Nov 8, 2019 6 min read bioinformatics, gi2019, conferences, talk

Day 3 - Session 7 - MICROBIAL AND METAGENOMICS
Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps
metaFlye—Scalable long-read metagenome assembly using repeat graphs
Detecting microbial transmission and engraftment after faecal microbiota transplants using long-read metagenomics and reticulatus
Entropy of a bacterial stress response is a generalizable predictor for fitness and antibiotic sensitivity
The use of kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data
Real-time assembly using Nanopore sequencing data for microbial communities
Exploring the role of ribosomal gene repeats in the context of regeneration
Genomic epidemiology of West Nile virus in California

Day 3 - Session 7 - MICROBIAL AND METAGENOMICS

Genome Informatics 2019 at CSHL

Bold is the speaker

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

My notes will likely reflect my background knowledge of microbes. Which is nil.

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Alexander T. Dilthey, Chirag Jain, Sergey Koren, Adam M. Phillippy.

Discussion of how MinHash works.

kmers, which are hashed, then sorted (take the smallest), compare kmers between query and ref

MinHash ~ Jaccard similarity (point to assess similarity, e.g. “alignment”)

“Mapping qualities”

want prob. dist. of all mapping locations for a read
assume
- read error rate known
compute jaccard similarity
model as Bernoulli

MetaMaps

integrated metagenomic classification system
takes lots from MashMap
auto-tunes minimum read length & identity (EM)

Benchmark

precision better than others (Kraken, Kraken2, Centrifuge, MEGAN-LR)
REQUIRES LONG READS (> 1000bp)
slower than other methods (and takes more mem)
- but are working on this

Going through some case studies on finding novel things

find mismatches from errors in ref databases
location mapping allows you to quantify categories of genes (or I totally misheard/misread)

github.com/DiltheyLab/MetaMaps

metaFlye—Scalable long-read metagenome assembly using repeat graphs

Mikhail Kolmogorov, Mikhail Rayko, Jeffrey Yuan, Evgeny Polevikov, Pavel Pevzner.

Genome Assembly, with graphs

Flye intro: - glue repeats into the graph

OK, I’m not following this. Sorry Mikhail!

Isn’t this a metagenome session? This is all genome assembly.

Oh, he is shifting now.

How to detect repeats?

in genomes can use coverage
not reliable enough in metagenomes
need to use graph

Emphasizes look at graph viz

want one loop for bacteria chr
reality sucks
bubbles oh my
- strain hetergeneity?
- or repeat -> bubble -> repeat?!
BUBBLES CAN HAVE BUBBLES

Showing cow rumen microbiome

Now Sheep microbiome

Looks like many successes and many crazy things

Emphasizes metaFlye is still very much in dev.

Recommends Bandage (Wick et al) and AGB (Mikheenko & Kolmogorov) for viz of assemblies

Asks us - would you prefer simple consensus or bubbly haplotype resolved assembly?

github.com/fenderglass/Flye

Detecting microbial transmission and engraftment after faecal microbiota transplants using long-read metagenomics and reticulatus

Samuel M. Nicholls, Joshua C. Quick, Tariq H. Iqbal, Nicholas J. Loman.

@samstudio8

Intro on why microbiome is important

health
exciting as field isn’t fully define for best practices

“three peaks challenge”

want extraction method that maintains composition (equal representation of strain?) and long reads

(everyone is so coy about what is a faecal transplant)

anyways….checking human microbiomes pre/post transplant

get looooong reads

showing n50, contig nums, etc. they look uh, big? (don’t have a great sense about what is acceptable)

diving into communities. oooooo very cool plot (bubble plot?) with families of bacteria. so many complete (?) genomes

need to do assembly polishing, but expensive computation

github.com/samstudio8/reticulatus (snakemake pipeline for polishing)

tech discussion of GPU acceleration of steps in pipeline

100 minutes -> 5 minutes with GPU acceleration
10 - 5 hours on promethION

File parsers!

poor gzip FQ parser
port heng li readfq…save 40 minutes

Huge time savings on real data

back to data

showing diffs pre/post transplantation
- novel communities?
how do they find these? eyeballing? diff testing?

Entropy of a bacterial stress response is a generalizable predictor for fitness and antibiotic sensitivity

Defne Surujon, Zeyu Zhu, Juan C. Ortiz-Marquez, José Bento, Tim van Opijnen.

Transciptional signatures can seprate straisn (antibiotic sensitive/resistant)

Tough to do diff testing because genes often just flat out missing in some strains

Want a more universal signature

across strains/conditions
without relying on previous knowledge of gene function/identity/patterns

Try to predict fitness

in vitro make high fitness for control (what? how? )

biorxiv.org/content/10.1101/813709v1

3d PCA of strains, colored by (expression?) of major classes of genes

noticed that high fitness genes had far less variance

gene expression is more chaotic in low fitness strains

can we use entropy? but how do we calculate entropy?

assume each gene independent (which def. is not correct in real world….but still works in practice)
calculate variance in gene expression
then sum all var?

making networks of gene interactions with some parameter tuning (to figure out edge nums/connectivity) to improve modelling

to save time/money/sanity moved over testing to just one timepoint

big diff in distribution width (log2FC density plot) between low/high fitness

Does this work for many pathogens?

seems like it

The use of kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data

Lauren Cowley, Jordan Taylor, Claire Jenkins, Tim Dallman.

@laurencowley4

Talking about food borne pathogens

apparently 50% (!) of UK Salmonella is associated with foreign travel / imported food

Discusses pathogen tracking system in UK

local doc, epidemiologists

Use RF for k-mer counting to figure out country of origin

fsm-lite used (what is this?)

kmer counter?
some kmers mark strains

big ’ol matrix

5 million kmers across 500 strains, for example

random forest (RF) intro

classifier
non continuous data
highly resistant to overfitting (bc lots ’o trees)
interpretability (kmers / features which are most informative for classifier)

train of 2017 data, test on 2018 data

71% accuracy for first pass
only good at domestic/foreign
non-england countries poorly represented
decided to aggregate countries by region
- little better
neural network gets to 80% accuracy with most common 100 kmers
try with most strains, fewer kmers
will try unitigs instead of kmers

Real-time assembly using Nanopore sequencing data for microbial communities

Son H. Nguyen, Minh D. Cao, Lachlan J. Coin.

Exploring the role of ribosomal gene repeats in the context of regeneration

Sofia N. Barreira, Andreas D. Baxevanis.

Woo NHGRI (did postdoc there)!

rDNA is highly repetitive in humans

big intergenic spacer (GS)

UBF (nucleolar tf) binds across rRNA

but this is human….baxevanis lab studies hydractinia (highly regenerative marine hydrozoan from phylum cnidaria)

you can decaptitate, and head regrows in 3 days

are there genomic features that contribute to regeneration?

got partial 18S and 28S seq from NCBI as well as human 5.8S and aligned to hydractinia genome assembly
enhancer elements in rRNA?
going to do chip-seq

what about UBF?

tough to ID
looking for emergence across nematodes, ctenophores, cnidaria, arthopods….chordates
doing a proteome assembly

Genomic epidemiology of West Nile virus in California

Karthik Gangavarapu, Nathaniel Matteson, on behalf of the WestNile 4K Project.

West Nile

birds <-> mosquito -> human (dead end for west nile, yay)
showing infection rate across time and US states
variable infection rates by state

sampling across california (CA) across time

seq -> phylogenetic tree

5 clades
cool flow moving plots showing movement of strains across CA across time
virus can persist across seasons

bioinformatics gi2019 conferences talk