Intro
Very sparse and poorly written notes covering #GI2018.
Typos everywhere. Things may change dramatically over time as I scan back through notes.
I’ve tried to respect #notwitter. Will be updated periodically.
Speaker (Lab | Group)
BOLDED is voice
2018-09-17
Sarah Teichmann (Teichmann Lab)
Cell Atlas Technologies and the Maternal-Fetal Interface
- 600 scientists
- scRNAseq + spatial methods
scRNA works OK
- 70%+ pearson cor (among genes?)
- Detection limit pretty good after 1 million reads
- Svensson and Teichmann preprint
Spatial Tech
- i.e. gene expression at voxel resolution in a non-destroyed tissue
- 1 voxel ~ 20 cells
- Hope to ID spatially variable genes
- [github.com/TeichLab/SpatialDE]
- Svensson and Teichmann
Braga and Stegle: Merge Spatial TX with scRNA
Future stuffs: Histology merged with scRNA
Back to HCA
- Many tissues, many scientists, many countries
- curious if any integration with Chan-Zuckerberg
Moving onto to Maternal-fetal interface
- tricky area
- immune component
- two organisms
- tumor like
- scRNA + WGS resolved fetal/maternal adn cell tyes
Girgio Gonnella (Stefan Kurtz)
Flexible and interactive visualization of GFA sequence graphs
Graphical Fragment Assembly
- format for representing sequence graphs
- contig output not as informative as graphs
GFA1 proposal format from Heng Li releaseed in 2016-09
GFA2 2017-01 more general
Today more assemblers are using
GFA2 format:
- header
- sequences
- relationships
- RGFA (Ruby)
- Gfapy (python)
GfaViz
- visualization of GFA2 (and GFA1)
- C++, QT, OGDF
- GUI and CLI
- Two layouts, many options to customize views
Scaffolding graphs
- dealing with pos gaps (missing seq) and neg gaps (contig overlap by repeats)
- Show how it looks with Bandage, no gap info
- much busier with GfaViz
- but now can see gap info between the pieces
Long reads
- local alignments are messy (noisy data)
- GFA2 has internal alignments
Release is later this year
- ggonnella on github
- email: gonnella@zbh.uni-hamburg.de
Luke Zappia (Oshlack?)
Using clustering trees to visualize scRNA-seq data
(caps my own) SO MANY CLUSTERING TOOLS
T-SNE IS NOT CLUSTERING
How do we decide how many clusters?
When deciding k (cluster num) you can think of a graph and with edge weights to assess interesting groups
Cells on edge / num of cells in high res cluster
Above equation good way to think about usefulness of cluster groups
As we increase k, we can see how the graph changes and can get a sense of whether k should be changed
Real data
- used Seurat to cluster
- see a branch with is distinct and doesn’t interact with anything else
- also see a stable region
- see low proportion edges if you really increase the k
I can’t doodle on the computer, which would be helpful here but this stuff looks really helpful
Using edge porportion (above eq.) and cluster relatinoships you get a decent sense of whether the clustering makes sense
WHAT SO COOL (overlap of cluster trees on t-SNE)
Q: could this be used to magic pick a k? A: Everyone asks this….please don’t….no idea what would happen
Laura Huerta (Papatheodorou)
Data curation integration and visualization
Oh boy, a viz talk that I’m going to cover with words
Expression Atlas
- open science resource for holding expression data
- super useful resource, unfortunately in my experience pain to get data into R with their R packages
- I find it easier to just find the tsv output link and then slurp that into R
- big value in consistent computational workflows
>3,300
datasets, from all the big consortia (GTEx, FANTOM, etc)
clusterSeq package
- missed what this does but looks like a good thing to check out…
- from bioconductor page: Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.
Can embed Expression Atlas data on other pages (ensembl does this)
scRNA-seq data also - again public data, consistent processing (extension of iRAP) - can see whether gene is considered a marker gene (not sure how this is picked) - have a t-SNE(?) interactive plot view and shows your favorite gene colored
Future:
- merge bulk and scRNA data
- ….something else I missed about metadata?
Q: How is data updated with genome build change? A: In lock step, can view old versions
Casey Greene (Greene)
Can “big data” help us tackle rare diseases?
~3.7 million assays / datasets
~3.8 billion USD
Really tough to compare across datasets with the “modular” approach
PLIER: decompose dataset into latent variables by genes and sample:
- Mao et al biorxiv 2017
Latent variables in individual datasets not helpful for dataset comparison - but could you run PLIER on datasetS and see whether it “works” - with SLE (lupus)….yes - with only 7 datasets, can find vars that are NOT dataset specific
Can you learn patterns from large datasets then transfer to individual dataset / problem?
- MultiPLIER
- ran with recount2 data on SLE (looking at ANC-associated)
- top loadings known info
- see how loading score drops in other (non SLE) tissues
- slides: https://www.dropbox.com/sh/j0ysaojqmac29qv/AADBHy-3y8qzWHr34F7M39QIa?dl=0
Pitch for http://researchparasite.com
Sergei Yakneen
Butler: a framework for large-scale scientific analysis on the cloud
[github.com/llevar/butler]
PCAWG
- pan cancer analysis of whole genomes
- 2,834 donors, 70,313 files, 729 TB
Show graph scaling off as pipelines choke as more samples get added
If progress was linear, years of compute (?) time could be saved
Key needs:
- provisioning
- config management
- workflow
- operations management
Tried to use “off the shelf” infrastructure / processes to build Butler
Workflows are … CWL?
This seems targeted towards a pretty small audience - people running huge compute platforms?
** I guess you could use this for a smaller projects and deploy to cloud, but I would think that the overhead of learning this would likely overwhelm the potential savings relative to using Snakemake / Docker / Conda. **