#GI2019 2019 11 08 Morning Session

Day 3 - Session 7 - MICROBIAL AND METAGENOMICS

Genome Informatics 2019 at CSHL

Bold is the speaker

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

My notes will likely reflect my background knowledge of microbes. Which is nil.

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

Alexander T. Dilthey, Chirag Jain, Sergey Koren, Adam M. Phillippy.

Discussion of how MinHash works.

kmers, which are hashed, then sorted (take the smallest), compare kmers between query and ref

MinHash ~ Jaccard similarity (point to assess similarity, e.g. “alignment”)

“Mapping qualities”

  • want prob. dist. of all mapping locations for a read
  • assume
    • read error rate known
  • compute jaccard similarity
  • model as Bernoulli

MetaMaps

  • integrated metagenomic classification system
  • takes lots from MashMap
  • auto-tunes minimum read length & identity (EM)

Benchmark

  • precision better than others (Kraken, Kraken2, Centrifuge, MEGAN-LR)
  • REQUIRES LONG READS (> 1000bp)
  • slower than other methods (and takes more mem)
    • but are working on this

Going through some case studies on finding novel things

  • find mismatches from errors in ref databases
  • location mapping allows you to quantify categories of genes (or I totally misheard/misread)

github.com/DiltheyLab/MetaMaps

metaFlye—Scalable long-read metagenome assembly using repeat graphs

Mikhail Kolmogorov, Mikhail Rayko, Jeffrey Yuan, Evgeny Polevikov, Pavel Pevzner.

Genome Assembly, with graphs

Flye intro: - glue repeats into the graph

OK, I’m not following this. Sorry Mikhail!

Isn’t this a metagenome session? This is all genome assembly.

Oh, he is shifting now.

How to detect repeats?

  • in genomes can use coverage
  • not reliable enough in metagenomes
  • need to use graph

Emphasizes look at graph viz

  • want one loop for bacteria chr
  • reality sucks
  • bubbles oh my
    • strain hetergeneity?
    • or repeat -> bubble -> repeat?!
  • BUBBLES CAN HAVE BUBBLES

Showing cow rumen microbiome

Now Sheep microbiome

Looks like many successes and many crazy things

Emphasizes metaFlye is still very much in dev.

Recommends Bandage (Wick et al) and AGB (Mikheenko & Kolmogorov) for viz of assemblies

Asks us - would you prefer simple consensus or bubbly haplotype resolved assembly?

github.com/fenderglass/Flye

Detecting microbial transmission and engraftment after faecal microbiota transplants using long-read metagenomics and reticulatus

Samuel M. Nicholls, Joshua C. Quick, Tariq H. Iqbal, Nicholas J. Loman.

@samstudio8

Intro on why microbiome is important

  • health
  • exciting as field isn’t fully define for best practices

“three peaks challenge”

  • want extraction method that maintains composition (equal representation of strain?) and long reads

(everyone is so coy about what is a faecal transplant)

anyways….checking human microbiomes pre/post transplant

get looooong reads

showing n50, contig nums, etc. they look uh, big? (don’t have a great sense about what is acceptable)

diving into communities. oooooo very cool plot (bubble plot?) with families of bacteria. so many complete (?) genomes

need to do assembly polishing, but expensive computation

github.com/samstudio8/reticulatus (snakemake pipeline for polishing)

tech discussion of GPU acceleration of steps in pipeline

  • 100 minutes -> 5 minutes with GPU acceleration
  • 10 - 5 hours on promethION

File parsers!

  • poor gzip FQ parser
  • port heng li readfq…save 40 minutes

Huge time savings on real data

back to data

  • showing diffs pre/post transplantation
    • novel communities?
  • how do they find these? eyeballing? diff testing?

Entropy of a bacterial stress response is a generalizable predictor for fitness and antibiotic sensitivity

Defne Surujon, Zeyu Zhu, Juan C. Ortiz-Marquez, José Bento, Tim van Opijnen.

Transciptional signatures can seprate straisn (antibiotic sensitive/resistant)

Tough to do diff testing because genes often just flat out missing in some strains

Want a more universal signature

  • across strains/conditions
  • without relying on previous knowledge of gene function/identity/patterns

Try to predict fitness

  • in vitro make high fitness for control (what? how? )

biorxiv.org/content/10.1101/813709v1

3d PCA of strains, colored by (expression?) of major classes of genes

  • noticed that high fitness genes had far less variance

gene expression is more chaotic in low fitness strains

can we use entropy? but how do we calculate entropy?

  • assume each gene independent (which def. is not correct in real world….but still works in practice)
  • calculate variance in gene expression
  • then sum all var?

making networks of gene interactions with some parameter tuning (to figure out edge nums/connectivity) to improve modelling

to save time/money/sanity moved over testing to just one timepoint

  • big diff in distribution width (log2FC density plot) between low/high fitness

Does this work for many pathogens?

  • seems like it

The use of kmer counts to train random forests to predict country of origin for bacterial pathogen sequencing data

Lauren Cowley, Jordan Taylor, Claire Jenkins, Tim Dallman.

@laurencowley4

Talking about food borne pathogens

  • apparently 50% (!) of UK Salmonella is associated with foreign travel / imported food

Discusses pathogen tracking system in UK

  • local doc, epidemiologists

Use RF for k-mer counting to figure out country of origin

fsm-lite used (what is this?)

  • kmer counter?
  • some kmers mark strains

big ’ol matrix

  • 5 million kmers across 500 strains, for example

random forest (RF) intro

  • classifier
  • non continuous data
  • highly resistant to overfitting (bc lots ’o trees)
  • interpretability (kmers / features which are most informative for classifier)

train of 2017 data, test on 2018 data

  • 71% accuracy for first pass
  • only good at domestic/foreign
  • non-england countries poorly represented
  • decided to aggregate countries by region
    • little better
  • neural network gets to 80% accuracy with most common 100 kmers
  • try with most strains, fewer kmers
  • will try unitigs instead of kmers

Real-time assembly using Nanopore sequencing data for microbial communities

Son H. Nguyen, Minh D. Cao, Lachlan J. Coin.

Exploring the role of ribosomal gene repeats in the context of regeneration

Sofia N. Barreira, Andreas D. Baxevanis.

Woo NHGRI (did postdoc there)!

rDNA is highly repetitive in humans

big intergenic spacer (GS)

UBF (nucleolar tf) binds across rRNA

but this is human….baxevanis lab studies hydractinia (highly regenerative marine hydrozoan from phylum cnidaria)

  • you can decaptitate, and head regrows in 3 days

are there genomic features that contribute to regeneration?

  • got partial 18S and 28S seq from NCBI as well as human 5.8S and aligned to hydractinia genome assembly
  • enhancer elements in rRNA?
  • going to do chip-seq

what about UBF?

  • tough to ID
  • looking for emergence across nematodes, ctenophores, cnidaria, arthopods….chordates
  • doing a proteome assembly

Genomic epidemiology of West Nile virus in California

Karthik Gangavarapu, Nathaniel Matteson, on behalf of the WestNile 4K Project.

West Nile

  • birds <-> mosquito -> human (dead end for west nile, yay)
  • showing infection rate across time and US states
  • variable infection rates by state

sampling across california (CA) across time

seq -> phylogenetic tree

  • 5 clades
  • cool flow moving plots showing movement of strains across CA across time
  • virus can persist across seasons

Related

comments powered by Disqus