#GI2018 - Day One


Very sparse and poorly written notes covering #GI2018.

Typos everywhere. Things may change dramatically over time as I scan back through notes.

I’ve tried to respect #notwitter. Will be updated periodically.

Speaker (Lab | Group)

BOLDED is voice


Sarah Teichmann (Teichmann Lab)

Cell Atlas Technologies and the Maternal-Fetal Interface

Human Cell Atlas plug

  • 600 scientists
  • scRNAseq + spatial methods

scRNA works OK

Spatial Tech

  • i.e. gene expression at voxel resolution in a non-destroyed tissue
    • 1 voxel ~ 20 cells
  • Hope to ID spatially variable genes
    • [github.com/TeichLab/SpatialDE]
    • Svensson and Teichmann

Braga and Stegle: Merge Spatial TX with scRNA

Future stuffs: Histology merged with scRNA

Back to HCA

  • Many tissues, many scientists, many countries
  • curious if any integration with Chan-Zuckerberg

Moving onto to Maternal-fetal interface

  • tricky area
  • immune component
  • two organisms
  • tumor like
  • scRNA + WGS resolved fetal/maternal adn cell tyes

Girgio Gonnella (Stefan Kurtz)

Flexible and interactive visualization of GFA sequence graphs

Graphical Fragment Assembly

  • format for representing sequence graphs
  • contig output not as informative as graphs

GFA1 proposal format from Heng Li releaseed in 2016-09

GFA2 2017-01 more general

Today more assemblers are using

GFA2 format:

  • header
  • sequences
  • relationships
  • RGFA (Ruby)
  • Gfapy (python)


  • visualization of GFA2 (and GFA1)
  • C++, QT, OGDF
  • GUI and CLI
  • Two layouts, many options to customize views

Scaffolding graphs

  • dealing with pos gaps (missing seq) and neg gaps (contig overlap by repeats)
  • Show how it looks with Bandage, no gap info
  • much busier with GfaViz
    • but now can see gap info between the pieces

Long reads

  • local alignments are messy (noisy data)
  • GFA2 has internal alignments

Release is later this year

Luke Zappia (Oshlack?)

Using clustering trees to visualize scRNA-seq data



How do we decide how many clusters?

When deciding k (cluster num) you can think of a graph and with edge weights to assess interesting groups

Cells on edge / num of cells in high res cluster

Above equation good way to think about usefulness of cluster groups

As we increase k, we can see how the graph changes and can get a sense of whether k should be changed

Real data

  • used Seurat to cluster
  • see a branch with is distinct and doesn’t interact with anything else
  • also see a stable region
  • see low proportion edges if you really increase the k

I can’t doodle on the computer, which would be helpful here but this stuff looks really helpful

Using edge porportion (above eq.) and cluster relatinoships you get a decent sense of whether the clustering makes sense

WHAT SO COOL (overlap of cluster trees on t-SNE)

Q: could this be used to magic pick a k? A: Everyone asks this….please don’t….no idea what would happen

Laura Huerta (Papatheodorou)

Data curation integration and visualization

Oh boy, a viz talk that I’m going to cover with words

Expression Atlas

  • open science resource for holding expression data
  • super useful resource, unfortunately in my experience pain to get data into R with their R packages
    • I find it easier to just find the tsv output link and then slurp that into R
  • big value in consistent computational workflows
  • >3,300 datasets, from all the big consortia (GTEx, FANTOM, etc)

clusterSeq package

  • missed what this does but looks like a good thing to check out…
  • from bioconductor page: Identification of clusters of co-expressed genes based on their expression across multiple (replicated) biological samples.

Can embed Expression Atlas data on other pages (ensembl does this)

scRNA-seq data also - again public data, consistent processing (extension of iRAP) - can see whether gene is considered a marker gene (not sure how this is picked) - have a t-SNE(?) interactive plot view and shows your favorite gene colored


  • merge bulk and scRNA data
  • ….something else I missed about metadata?

Q: How is data updated with genome build change? A: In lock step, can view old versions

Casey Greene (Greene)

Can “big data” help us tackle rare diseases?

~3.7 million assays / datasets

~3.8 billion USD

Really tough to compare across datasets with the “modular” approach

PLIER: decompose dataset into latent variables by genes and sample:

  • Mao et al biorxiv 2017

Latent variables in individual datasets not helpful for dataset comparison - but could you run PLIER on datasetS and see whether it “works” - with SLE (lupus)….yes - with only 7 datasets, can find vars that are NOT dataset specific

Can you learn patterns from large datasets then transfer to individual dataset / problem?

Pitch for http://researchparasite.com

Sergei Yakneen

Butler: a framework for large-scale scientific analysis on the cloud



  • pan cancer analysis of whole genomes
  • 2,834 donors, 70,313 files, 729 TB

Show graph scaling off as pipelines choke as more samples get added

If progress was linear, years of compute (?) time could be saved

Key needs:

  • provisioning
  • config management
  • workflow
  • operations management

Tried to use “off the shelf” infrastructure / processes to build Butler

Workflows are … CWL?

This seems targeted towards a pretty small audience - people running huge compute platforms?

** I guess you could use this for a smaller projects and deploy to cloud, but I would think that the overhead of learning this would likely overwhelm the potential savings relative to using Snakemake / Docker / Conda. **


comments powered by Disqus