#GI2019 2019 11 07 Night Session

Nov 7, 2019 6 min read bioinformatics, gi2019, conferences, talk

Day 2 - Session 6 - TRANSCRIPTOMICS
The functional iso-transcriptomics analysis framework to assess the functional impact of alternative isoform usage
Multi-resolution, interactive, atlas-scale integration of single-cell assays and experiments
Efficient and robust transcriptome reconstruction from long-read RNA-seq alignments
Deconvolving the pervasive transcription from jumping genes in RNA-seq and unveiling their role in tumors
Alignment and mapping methodology influence transcript abundance estimation
A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
Quantifying isoform expression in single-cell RNA-seq data with STARsolo-Quant
Full-length transcript characterization for single cell RNA-seq analysis

Day 2 - Session 6 - TRANSCRIPTOMICS

Genome Informatics 2019 at CSHL

Bold is the speaker

If you dislike/disagree with my notes/sentiment and you are the speaker/PI then contact me. I very much could be mis-understanding some important points.

The functional iso-transcriptomics analysis framework to assess the functional impact of alternative isoform usage

Lorena de la Fuente, Manuel Tardáguila, Hector del Risco, Angeles Arzalluz, Francisco Pardo Palacios, Pedro Salguero, Cristina Martín, Sonia Tarazona, Ana Conesa.

Missed most of this talk to FaceTime goodnight to my kids.

Not certain what is happening - is this some sort of analysis package?

OK, went to twitter.

http://tappas.org

GUI system for analysis?

Multi-resolution, interactive, atlas-scale integration of single-cell assays and experiments

Akshay Balsubramani, Anshul Kundaje.

So many sc data types

CITE, FACS, MERFISH, scATAC…..etc forever

Want to unify datasets from different experiments

Using Tabula Muris data

Has both Droplet and well based data with rougly equal n
also both annotated with similar-ish cells

Connect cells of similar cell types (anchors)
Impute measured features within clusters
Refine co-embeddings on subpopulations as needed

(I think only this last point may be new? I think many do the first two…)

Use “Batch mxiing entropy” to quantify whether tissue types are clustering together

Show how Seurat works OK (“overmixes” some cells)

“Musica”

kNN (nearest neighbors)

Need existing labels to integrate? This would be a deal breaker for many….

Is this a package/software tool???

Want to align all single cell data, agnostic of tech.

No software.

So impossible to evaluate whether this is useful.

Efficient and robust transcriptome reconstruction from long-read RNA-seq alignments

Sam Kovaka, Aleksey V. Zimin, Geo M. Pertea, Roham Razaghi, Steven L. Salzberg, Mihaela Pertea.

Going over StringTie, which makes tx-ome from short-read (via graph) and high quality ref genome.

Short-reads have two big limitations

rarely span more than two exons
multi-mapping

Limitations of long read

error rate (PacBio simualated looks WAY worse than I would have expected….not certain why)
low throughput (can only get info on highly expressed genes)

Effect of noise on the splice graph

makes the graph crazy complicated very quickly
which is why StringTie has so much trouble with long-read data

StringTie2

represent splice graph as matrix (now is sparse matrix unlike StringTie1)
correct splice site error by using consensus
remove edges from splice graph which are “poorly” supported
handles downsampled truth annotations pretty well (StringTie has guided mode which uses known genes to build transcripts)

Deconvolving the pervasive transcription from jumping genes in RNA-seq and unveiling their role in tumors

Fabio Navarro, Jacob Hoops, Lauren Bellfy, Eliza Cerveira, Qihui Zhu, Chengsheng Zhang, Charles Lee, Mark Gersten.

More than half of the genome is repetitive

SINE, LINE, LTRs, Simple repeats

Focus on LINE1 (most common, right?)

Estimation of TE transcription in genome is hard

tossing out multi-mappers throws out real data

TeXP: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007293

How can we reliably quantify TE tx?

Alignment and mapping methodology influence transcript abundance estimation

Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I. Love, Carl Kingsford, Rob Patro.

Starts of with BO Li RSEM equation to estimate abundance

This talk focuses on mapping approaches and how they influence tx quant

“The sample is not the reference”

Tested custom tx-ome vs ref

Bowtie2 and STAR have largest drop in performance

still difference isn’t huge

Used real data

both bulk and single cell (full length)
no ground truth
- hand check alignments
found some themes….and we are on the next slide

Made “decoy-aware” tx-ome

mask exonic regions
take rest and find similar regions, use this for decoy
all reads that map best to decoy are tossed

“Oracle”

use bowtie2 against txome and STAR against genome (aware of tx)
if some filter thingy happens, then toss read

sorry these notes sucks….too busy listening…to this information dense talk….

STAR seems the best…across the testing

https://www.biorxiv.org/content/10.1101/657874v2

A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

Dana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese, Shan Jiang, Brian Williams, Barbara Wold, Ali Mortazavi.

TALON!

quantification of long read RNA-sdq

also makes high quality annotations

can add new data without re-running existing

used SQANTI tx category names

short tx can (often) be artifacts

rna degradation, polyA priming issues

lots of filtering to toss likely erroneous tx

PacBio is consistent with PacBio (pearson 0.9) at gene and tx level (0.7)

can also reliably detect egnes above TPM 5

Direct-RNA ONT also reproducible

a bit worse than PacBio
misses many lower read level genes
maybe because reads that get stuck in pore look like short tx
fewer RTPCR artifacts (as expected…since there’s no RTPCR)

github.com/dewyman/TALON github.com/dewyman/TranscriptClean

Quantifying isoform expression in single-cell RNA-seq data with STARsolo-Quant

Nathan Castro-Pacheco, Alexander Dobin.

Going over droplet based scRNA

e.g. DropSeq, 10X
high dropout, short reads

Discussing tx isoforms….not many quantify

exception: TCC (Ntranos….Pachter)

Want to do tx isoform quant with sc data

But short reads at 3’ end…don’t tend to map uniquely

Propose merging clusters of scRNA to get higher counts

quant -> cluster -> do quant again with pseudo bulk? (this is me guessing)

EM based quant, models 3’ end bias

STARsolo -> cluster -> STARsolo-Quant -> combine reads from each cluster -> EMaximization (well I wasn’t super far off)

Showing a bunch of case studies with individual genes and the effect of low counts

If they a bunch of tx have identical end, then you are screwed (so much multi-mapping) - they propose just merging these together

With splatter (simulated data) spearman R 0.93

Full-length transcript characterization for single cell RNA-seq analysis

Elizabeth Tseng, Jason Underwood.

@magdoll

Iso-Seq

full length RNA-seq
uses CCS
8 passes get QV30

Did Illumina and PacBio single cell

droplet? what? SMARTseq? didn’t say….
anyways, spearman R 0.96

scRNA is shorter than bulk (for tx length)

clear left-ward shift in scRNA data

github.com/magdoll/sqanti2

Talks about orthogonal valdation of full length, 5’, 3’ matching

How to represent isoforms

so many (>200k, depending how you count)
top 3 for each gene?
just unique ORFs?
- uh, what? is that a good idea? I guess I’m not certain why you HAVE to trim at this step

bioinformatics gi2019 conferences talk