#GI2018 - Day One

Sep 17, 2018 5 min read GI2018, bioinformatics

Intro
2018-09-17
Sarah Teichmann (Teichmann Lab)
Girgio Gonnella (Stefan Kurtz)
Luke Zappia (Oshlack?)
Laura Huerta (Papatheodorou)
Casey Greene (Greene)
Sergei Yakneen

Intro

Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

BOLDED is voice

2018-09-17

Sarah Teichmann (Teichmann Lab)

Cell Atlas Technologies and the Maternal-Fetal Interface

Human Cell Atlas plug

600 scientists
scRNAseq + spatial methods

scRNA works OK

70%+ pearson cor (among genes?)
Detection limit pretty good after 1 million reads
Svensson and Teichmann preprint

Spatial Tech

i.e. gene expression at voxel resolution in a non-destroyed tissue
- 1 voxel ~ 20 cells
Hope to ID spatially variable genes
- [github.com/TeichLab/SpatialDE]
- Svensson and Teichmann

Braga and Stegle: Merge Spatial TX with scRNA

Future stuffs: Histology merged with scRNA

Back to HCA

Many tissues, many scientists, many countries
curious if any integration with Chan-Zuckerberg

Moving onto to Maternal-fetal interface

tricky area
immune component
two organisms
tumor like
scRNA + WGS resolved fetal/maternal adn cell tyes

Girgio Gonnella (Stefan Kurtz)

Flexible and interactive visualization of GFA sequence graphs

Graphical Fragment Assembly

format for representing sequence graphs
contig output not as informative as graphs

GFA1 proposal format from Heng Li releaseed in 2016-09

GFA2 2017-01 more general

Today more assemblers are using

GFA2 format:

header
sequences
relationships
RGFA (Ruby)
Gfapy (python)

GfaViz

visualization of GFA2 (and GFA1)
C++, QT, OGDF
GUI and CLI
Two layouts, many options to customize views

Scaffolding graphs

dealing with pos gaps (missing seq) and neg gaps (contig overlap by repeats)
Show how it looks with Bandage, no gap info
much busier with GfaViz
- but now can see gap info between the pieces

Long reads

local alignments are messy (noisy data)
GFA2 has internal alignments

Release is later this year

ggonnella on github
email: gonnella@zbh.uni-hamburg.de

Luke Zappia (Oshlack?)

Using clustering trees to visualize scRNA-seq data

(caps my own) SO MANY CLUSTERING TOOLS

T-SNE IS NOT CLUSTERING

How do we decide how many clusters?

When deciding k (cluster num) you can think of a graph and with edge weights to assess interesting groups

Cells on edge / num of cells in high res cluster

Above equation good way to think about usefulness of cluster groups

As we increase k, we can see how the graph changes and can get a sense of whether k should be changed

Real data

used Seurat to cluster
see a branch with is distinct and doesn’t interact with anything else
also see a stable region
see low proportion edges if you really increase the k

I can’t doodle on the computer, which would be helpful here but this stuff looks really helpful

Using edge porportion (above eq.) and cluster relatinoships you get a decent sense of whether the clustering makes sense

WHAT SO COOL (overlap of cluster trees on t-SNE)

Q: could this be used to magic pick a k? A: Everyone asks this….please don’t….no idea what would happen

Laura Huerta (Papatheodorou)

Data curation integration and visualization

Oh boy, a viz talk that I’m going to cover with words

Expression Atlas

open science resource for holding expression data
super useful resource, unfortunately in my experience pain to get data into R with their R packages
- I find it easier to just find the tsv output link and then slurp that into R
big value in consistent computational workflows
>3,300 datasets, from all the big consortia (GTEx, FANTOM, etc)

clusterSeq package

missed what this does but looks like a good thing to check out…
from bioconductor page: Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.

Can embed Expression Atlas data on other pages (ensembl does this)

scRNA-seq data also - again public data, consistent processing (extension of iRAP) - can see whether gene is considered a marker gene (not sure how this is picked) - have a t-SNE(?) interactive plot view and shows your favorite gene colored

Future:

merge bulk and scRNA data
….something else I missed about metadata?

Q: How is data updated with genome build change? A: In lock step, can view old versions

Casey Greene (Greene)

Can “big data” help us tackle rare diseases?

~3.7 million assays / datasets

~3.8 billion USD

Really tough to compare across datasets with the “modular” approach

PLIER: decompose dataset into latent variables by genes and sample:

Mao et al biorxiv 2017

Latent variables in individual datasets not helpful for dataset comparison - but could you run PLIER on datasetS and see whether it “works” - with SLE (lupus)….yes - with only 7 datasets, can find vars that are NOT dataset specific

Can you learn patterns from large datasets then transfer to individual dataset / problem?

MultiPLIER
ran with recount2 data on SLE (looking at ANC-associated)
top loadings known info
see how loading score drops in other (non SLE) tissues
slides: https://www.dropbox.com/sh/j0ysaojqmac29qv/AADBHy-3y8qzWHr34F7M39QIa?dl=0

Pitch for http://researchparasite.com

Sergei Yakneen

Butler: a framework for large-scale scientific analysis on the cloud

[github.com/llevar/butler]

PCAWG

pan cancer analysis of whole genomes
2,834 donors, 70,313 files, 729 TB

Show graph scaling off as pipelines choke as more samples get added

If progress was linear, years of compute (?) time could be saved

Key needs:

provisioning
config management
workflow
operations management

Tried to use “off the shelf” infrastructure / processes to build Butler

Workflows are … CWL?

This seems targeted towards a pretty small audience - people running huge compute platforms?

** I guess you could use this for a smaller projects and deploy to cloud, but I would think that the overhead of learning this would likely overwhelm the potential savings relative to using Snakemake / Docker / Conda. **

conference GI2018 bioinformatics talks