#GI2019 2019 11 06

Nov 6, 2019 7 min read bioinformatics, gi2019, conferences, talk

Day 1 - Session 1 - GENOME STRUCTURE AND FUNCTION

Day 1 - Session 1 - GENOME STRUCTURE AND FUNCTION

Genome Informatics 2019 at CSHL

Bold is the speaker

349 attendees. Largest?

189 posters

39 (?) talks

90 countries, 20 (US) states

High(est) % female, URM for gi

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

Comparative 3D genome organization in Apicomplexan parasites

Evelien M. Bunnik, Aarthi Venkat, Ferhat Ay, Karine G. Le Roch.

Apicomplexan parasites:

toxoplasma gondii
babesia
most prevalent nad morbidity causing pathogens

Try to understand at genomic level

Malaria (plasmodium) genome published 18 years ago

82% AT rich
grab bag of major -omics techniques
still poor understanding of gene regulation
believe under-representation of TFs identified (lower rate than human, yeast)

Did Hi-C (chrmatin conformation capture)

at many stages
100kb resolution (at least for later work?)
going through their literature of chromosome organization (Ay, Bunnik Genome Research 2014)

Virulence Genes

mediates adherence of infected RGC to endothelium of infected RBC
altered chromatin structure at some of these genes

More examples of loops for other pathogenesis processes

see power of doing many stages (stage specific patterning of contact maps)

Did more Hi-C across other parasites, unfortunately had to re-assemble several genomes

Bunnik PNAS 2019
more expression at genes near (in 3D space) telomore
which is where many virulence genes are …

Unscrambling the tumor genome via integrated analysis of structural variation and copy number

Daniel L. Cameron, Charles Shale, Jonathan Baber, Anthony T. Papenfuss, Peter Priestly.

Hartwig has a fully automated oncology report (“no touch”)

needs to be highly sensitive and specific
use 110x tumor, 45x germline
SV and CNV calling not ideal

GRIDSS -> PURPLE -> LINX

SV-> CNV -> Interpretation

https://github.com/PapenfussLab/gridss

full pipeline: https://github.com/hartwigmedical/gridss-purple-linx

work on phasing SV (cis vs trans vs unknown)

only good up to about 300bp
but events are clustered, so works better than you’d expect

Benchmarking (includes GRIDSS)

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5

Tissue-specific enhancer functional networks for associating distal regulatory regions to disease

Xi Chen, Jian Zhou, Chandra L. Theesfeld, Ran Zhang, Aaron K. Wong, Olga G. Troyanskaya.

How to predict function of enhancers?

Different ways enhancers associated with disease:

enhancer -> gene (near)
enhancer(with SNP inside)
enhancer(rare mutation inside)
enhance without direct evidence

Hypothesis:

enhancer will have a functional relationship with another enhancer if:
- 3d interaction
- co-activation with TF
- regulate related genes
if all met, then maybe disease associated

bayesian inference to model the above assumptions

how to see if this sort of works?

check brain specific super enhancers in non-brain tissues
get some enrichment

https://fenrir.flatironinstitute.org

web tool
no preprint or paper?

dmcg note:

seems like you’d really want some en-masse validation. validating a few sites doesn’t seem like enough.
i missed what the input data for these models are (public stuff? Internal stuffs?)

Targeted Nanopore sequencing with Cas9 for studies of methylation, structural variants, and mutations

Timothy E. Gilpatrick, Isac Lee, Fritz J. Sedlazek, Winston Timp.

Want to enrich regions for genome

ROI -> cut out with CRISPR (cas9 on both sides of ROI) -> dA tail -> adapter ligate

oh even cutting just once helps with enrichment. cool.
can target 10 sites simulateously (400X+ coverage with MinION)

Tried with cancer model

On-target perentage (reads) is about 1-4%

most off-target randomly distributed
very little off-target cutting

5% error rate!

homopolymers sucks
guppy v3 does better job at SNV calling

did downsampling test to check SNV performance

tools
- mpileup
- Nanopolish (Simpson et al. ?)
- Medaka (ONT)
- Clair (Luo et al.)
can get 95% sesnitivity…doesn’t seem to great
Clair gets worse with more coverage (trained on 25X apparently)

HE IS GOING SO FAST OMG

More discussion of benchmarking - still in SNV calling

Now we are talking phasing

in TP53 find all true variants, no FP

Suggestion of CNV with phased data, more copies of one allele (60X vs 20X)

Now CpG

enriched stuff can still get detectable CpG calls

SV & Repetitive

still works

dmcg note:

everything still works (isn’t that expected?)
on-target percentage VERY low (1-4% means you are “throwing” lots of data away)
and yield is getting hammered…so doubly bad
not doing anything like getting hundreds of genes…which I would think would be really necessary to be useful clinically
size smaller? (length of DNA reads)

Long-read sequencing of structurally variant genomes

Evan E. Eichler.

4% duplication

49% of normal CNV maps to this 4% of genome

Long reads increase resolution of SV

SV detection comaprison between Illumina and PacBio (30k with PB, 11k with Illumina)

70% of SV missed with just Illumina data

Did SV analysis with LR seq method across diverse human genomes

found near 100k resolved SV
exciting to see a plateau of novel SV! (east africa most unique, as expected)

Improve SV calling with short read data with improved genomes

can find new eQTL

Goal: t2t (telomere 2 telomere for chromosome)

t2t consortium

Solution:

HiFi PacBio 99.9% acccurate 18kbp
ONT gives you ultra long (whale!) reads

HiFi & Strandseq(?) to build referene-free phased human genomes

hifi & canu -> saarclust & strandseq -> deepvariant -> whatshap & strandseq -> canu/peregrine

assemboy -> contigs -> snp calls -> haplotype CCS reads -> assembly

23 chr (“clusters”)

95% SNP phase

N50 = 28 Mbp

problematic regions

canu and peregrine fail
250 in count
60% are supplemental dups
18% SV
23% acrocentric

solutions

build graghs around high dup regions of human genome
“SDA” assembler

Long reads getting centrome seq of chr8 (t2t consortium)

Exploring the 3D spatial dependency of gene expression using Markov random fields

Naihui Zhou, Iddo Friedberg, Mark S. Kaiser.

The quantitative expression of gene regulation could manifest as spatial dependency

Hypothesis:

TF factories (clustering of genes for active transcription)
hub-enhancers (shared enhancers for many genes)

Propose adding spatial location of genes to improve modeeling of gene-gene variation (which is already modelled in most diff expression analyses, e.g. limma)

Poisson Hierarchical Markov Random Field Model (PHiMRF)

uses HiC data to get pairwise gene interactions
two genes called neighbors if HiC interaction > threshold
making a spatial gene network
genes are nodes, edges are HiC interaction

New random variable Y_ik which follows binomial dist (? missed this)

Do genes in TAD have spatial dependencies?

but how to deal with interTAD and intraTAD (within and between)?
run model with both types of edges
Yes…more connected than bootstraping tests

davemcg note: Still waiting to see if/how this improves diff expression?

Enriched GO terms for most conncted gene groups

tx by RNA polII
GPCR signalling
negative cell proliferation

R package! https://github.com/ashleyzhou972/PhiMRF

Mapping cis-eQTL from RNA-seq data with no genotypes

Elena Vigorito, Simon White, Chris Wallace.

Can’t always get genotypes….so would like eQTL without them….

Impute cis-SNP (G_i)

negative binomial weighted by probable genotype

work in 500kb window to compute gene - SNP pairings

benchmark against GEUVADIS reference data

only chr22 (why? performance?)
better sensitivity and specificity

R package out “soon”

only method for eQTL without genotypes

Exploring short tandem repeat expansions at both known and novel loci in the human genome

Harriet Dashnow, Brent Pedersen, Daniel MacArthur, Alicia Oshlack, Aaron Quinlan.

@hdashnow (twitter handle! good!)

STR is 1-6bp repeats. Microsatellites.

3% of human genome with high mutation rate and high polymorphism

Huntington’s, fragile X, spinocerebellar ataxias

Classically, longer is worse prognosis / higher severity

STR alleles are WAY longer than short reads

Harriet looking for NOVEL (str NOT IN REFERENCE GENOME)

not so many methods fornovel (just ExpansionHunter? as of 2 months ago)

STR reads often mis-mapped

read STR…where does it go???
sends it to just whatever STR or nowhere if really novel

STRling!

https://github.com/quinlan-lab/STRling
use kmers to find STR
if read passes kmer threahold, then is a STR … then re-map
for longer pairs use linear regression to predict allele size

Validated acdross 9 (7 worked) disease expansions

Found 5/5 novel expansions in CANVAS disease

Also validation with PacBio

Limitations! (yay! seriously love this)

limited to large (>50 bp) STR expansions

bioinformatics gi2019 conferences talk