Bioinformatics

#GI2018 - Day Three

Day 3 Rafael Irizarry Keynote Session 4: Transcriptomics, Alternative Splicing and Gene Predictions Mark Robinson (Robinson) Hagen Tilgner (Tilgner) Nikka Keivanfar (Church, 10X) Koen Van den Berge (Clement) Barbara Englehardt (Englehardt) Jeff Gaither (White) Fiona Dick (Tzoulis) Epigenetics and non-coding genome Jordana Bell (Bell) Wouter Meuleman (Stamatoyannopoulos) Maša Roller (Flicek) Alexander Suh (Suh) Raquel Garcia-Perez (Juan) Day 3 Very sparse and poorly written notes covering #GI2018.

#GI2018 - Day Two

Day Two Personal and Medical Genomics Katie Pollard Keynote Sri Kosuri (Kosuri) Patrick Brennan (Nationwide Children’s) Kaitlin Samocha (Hurles) Tracy Ballinger (Semple) Lucia Spangenberg (Naya) Comparative, Evolutionary, Metagenomics Ellen Leffler (Kwiatkowski) Luca Penso-Dolfin (Di Palma) Mario Caccamo (NIAB) Carla Cummins (Flicek) Day Two Personal and Medical Genomics Very sparse and poorly written notes covering #GI2018. Typos everywhere. Things may change dramatically over time as I scan back through notes.

#GI2018 - Day One

Intro 2018-09-17 Sarah Teichmann (Teichmann Lab) Girgio Gonnella (Stefan Kurtz) Luke Zappia (Oshlack?) Laura Huerta (Papatheodorou) Casey Greene (Greene) Sergei Yakneen Intro Very sparse and poorly written notes covering #GI2018. Typos everywhere. Things may change dramatically over time as I scan back through notes. I’ve tried to respect #notwitter. Will be updated periodically. Speaker (Lab | Group) BOLDED is voice 2018-09-17 Sarah Teichmann (Teichmann Lab) Cell Atlas Technologies and the Maternal-Fetal Interface

Quick Guide to Gene Name Conversion

Background There are several popular naming systems for (human) genes: RefSeq (NM_000350) Ensembl (ENSG00000198691) HGNC Symbol (ABCA4) Entrez (24) Given enough time in #bioinformatics, you will have to do every possible combination of conversions. This post will very briefly explain the most expedient way to automatically convert between these formats with R. More exhaustive resources http://crazyhottommy.blogspot.com/2014/09/converting-gene-ids-using-bioconductor.html https://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/ Ensembl <-> HGNC <-> Entrez Stephen Turner has built a small set of data frames (well, tibbles) with core information, including transcript <-> gene info.

#BoG18: Talk Notes

Intro Genome Engineering and Genome Editing (Tuesday Night) Jef Boeke Writing Genomes “dark matter” big dna Greg Findlay (Jay Shendure) Stephen Levene (Andrew Fire) David Truong (Jef Boeke) Feng Zhang Molly Gasperini (Jay Shendure) Eilon Sharon (Hunter Fraser) Luca Pinello Population Genomics (Wednesday morning) Mattias Joakobsson Jaemin Kim (Elaine Ostrander) Ipsita Agarwal (Molly Przeworski) Amnon Koren Sarah Tishkoff Patrick Albers (Gil McVean) Laura Hayward (Guy Sella) Functional Genetics and Epigenomics Job Dekker Flora Vaccarino Carninci Johnathan Griffiths (Berthold Gottgens) Emma Farley Jake Yeung (Felix Naef) Minal Caliskan (Casey Brown) Parisa Razaz (Talkwoski) Evolutionary and Non-human genomics Monica Justice Arang Rhie (Erich Jarvis, Adam Phillippy) Olga Dudchenko (Erez Lieberman Aiden) Kasper Munch Gavin Sherlock Anne Ruxandra Carvunis Elaine Ostrander Bobbie Cansdale (Claire Wade) Cancer and Medical Genomics Trey Ideker Rajbir Batra (Carlos Caldas) Patrick Short (Matthew Hurles) Max Shen (David Gifford) Sharon Plon (PJ Lupo) Massa Shoura (Andrew Fire) Sidi Chen Marcin Imielinski Computational Genomics (!

Easy bam downsampling

When you have a set of ChIP-seq (like) files, it is sometimes useful to downsample the larger samples to more closely match most of the samples. Tommy Tang goes into more detail in his blog post. Unfortunately the tool suites I use most for bam files (samtools and picard) only downsample to a percentage. Which isn’t ideal when you want your files to be no more than n reads. This post is just a slight one-upping of Tommy Tang’s process to easily downsample a bam.

Are you in genomics and building models? Stop using ROC - use PR

tldr Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) is a terrible metric for a genomics problem. Do not use it. This metric also goes by AUC or AUROC. Use Precision Recall AUC. Inspiration for this post I am working on a machine learning problem in genomics I was getting really confused why AUROC was so worthless scienceTwitter featuring Anshul Kundaje I want to save you (some time) What’s a ROC?

Let’s Plot 4: R vs Excel, Round 1

Introduction Data Cleaning Reformatting Box Plot Boxplot with all the data displayed I used to prefer violin plots I’m a fan of beeswarm plots with boxplots Doing statistics. Session Introduction The battle that we’ve all been waiting for. Excel vs. R. Bar plot versus a plot that actually shows the data. Yeah, this isn’t a fair fight. Bar plots are terrible. Why? Simple. They don’t show what your data looks like.

Let’s Plot 3: Base pair resolution NGS (exome) coverage plots - Part 2

Introduction Call mosdepth on bam to calculate bp-specific read depth Intersect base pair depth info with transcript and exon number Now it’s R time! Prepare Metadata Load mosdepth / bedtools intersect data and prep Plot Maker, version 1 Version 2 sessionInfo() Introduction This is a barebones (but detailed enough, I hope) discussion of how to take a bam file, extract base pair resolution coverage data, then finagle the data into coverage plots by gene and exon.

Let’s Plot 3: Base pair resolution NGS coverage plots (Part I)

Load data Curious? Data How many genes are in this dataset? What genes are in here? How many data points (bases) per gene? How many exons per gene? How many base pairs of ABCA4 (well, ABCA4 exons) is covered by more than 10 reads? 5 reads? Let’s check all of the genes to see which are the worst covered We can visually display the data, also Hard to see what is going on, let’s make little plots for each gene Where are genes poorly covered?