Background
There are several popular naming systems for (human) genes:
- RefSeq (NM_000350)
- Ensembl (ENSG00000198691)
- HGNC Symbol (ABCA4)
- Entrez (24)
Given enough time in #bioinformatics
, you will have to do every possible combination of conversions.
This post will very briefly explain the most expedient way to automatically convert between these formats with R
.
More exhaustive resources
http://crazyhottommy.blogspot.com/2014/09/converting-gene-ids-using-bioconductor.html
https://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/
Ensembl <-> HGNC <-> Entrez
Stephen Turner has built a small set of data frames (well, tibbles) with core information, including transcript <-> gene info. You just install the library, run library(annotables)
and you have tibbles for the info. Super easy.
https://github.com/stephenturner/annotables
## install steps, run once
# install.packages("devtools")
# devtools::install_github("stephenturner/annotables")
library(annotables)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
grch38 %>% head()
## # A tibble: 6 x 9
## ensgene entrez symbol chr start end strand biotype description
## <chr> <int> <chr> <chr> <int> <int> <int> <chr> <chr>
## 1 ENSG000… 7105 TSPAN6 X 1.01e8 1.01e8 -1 protei… tetraspanin 6…
## 2 ENSG000… 64102 TNMD X 1.01e8 1.01e8 1 protei… tenomodulin […
## 3 ENSG000… 8813 DPM1 20 5.09e7 5.10e7 -1 protei… dolichyl-phos…
## 4 ENSG000… 57147 SCYL3 1 1.70e8 1.70e8 -1 protei… SCY1 like pse…
## 5 ENSG000… 55732 C1orf… 1 1.70e8 1.70e8 1 protei… chromosome 1 …
## 6 ENSG000… 2268 FGR 1 2.76e7 2.76e7 -1 protei… FGR proto-onc…
# or grch37, grcm38, rnor6, galgal5, wbcel235, bdgp6, mmul801
But, he did not add Refseq names. So if you need to get RefSeq names into one of the others, you’ll have to do another step.
biomaRt (RefSeq <-> (Ensembl <-> HGNC <-> Entrez))
Ensembl’s biomaRt tool is super powerful. And very annoying to use for me, as I find the syntax impossible to remember. Also it takes over the dplyr select
function with its own select
. You’ll notice here I do not load biomaRt.
But it can convert just about anything to anything.
## install steps, run once
# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
# library(biomaRt) # <- don't load!, just use the ::
mart<- biomaRt::useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
# mapping example
refseq_ids <- c("NM_006573", "NM_002985", "NM_032965", "NM_002987", "NM_006274", "NM_004591", "NM_002990")
refseq_mapping <- biomaRt::getBM(attributes = c("refseq_mrna","hgnc_symbol"),
filters="refseq_mrna", # you swap out of this filter for whatever your input is
values=refseq_ids, # vector of your NMf
mart=mart)
refseq_mapping
## refseq_mrna hgnc_symbol
## 1 NM_002985 CCL5
## 2 NM_002987 CCL17
## 3 NM_002990 CCL22
## 4 NM_004591 CCL20
## 5 NM_006274 CCL19
## 6 NM_006573 TNFSF13B
## 7 NM_032965 CCL15
If you want to get the rest of the info in Annotables matched up with the RefSeq NM, then just do a left_join
left_join(refseq_mapping %>% select(refseq_mrna, symbol = hgnc_symbol), grch37)
## Joining, by = "symbol"
## refseq_mrna symbol ensgene entrez chr start
## 1 NM_002985 CCL5 ENSG00000161570 6352 17 34198495
## 2 NM_002985 CCL5 ENSG00000271503 6352 HG385_PATCH 34198510
## 3 NM_002987 CCL17 ENSG00000102970 6361 16 57438679
## 4 NM_002990 CCL22 ENSG00000102962 6367 16 57392684
## 5 NM_004591 CCL20 ENSG00000115009 6364 2 228678558
## 6 NM_006274 CCL19 ENSG00000172724 6363 9 34689564
## 7 NM_006573 TNFSF13B ENSG00000102524 10673 13 108903588
## 8 NM_032965 CCL15 ENSG00000267596 6359 17 34323476
## end strand biotype
## 1 34207797 -1 protein_coding
## 2 34207812 -1 protein_coding
## 3 57449974 1 protein_coding
## 4 57400102 1 protein_coding
## 5 228682272 1 protein_coding
## 6 34691274 -1 protein_coding
## 7 108960832 1 protein_coding
## 8 34329084 -1 protein_coding
## description
## 1 chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 2 chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 3 chemokine (C-C motif) ligand 17 [Source:HGNC Symbol;Acc:10615]
## 4 chemokine (C-C motif) ligand 22 [Source:HGNC Symbol;Acc:10621]
## 5 chemokine (C-C motif) ligand 20 [Source:HGNC Symbol;Acc:10619]
## 6 chemokine (C-C motif) ligand 19 [Source:HGNC Symbol;Acc:10617]
## 7 tumor necrosis factor (ligand) superfamily, member 13b [Source:HGNC Symbol;Acc:11929]
## 8 chemokine (C-C motif) ligand 15 [Source:HGNC Symbol;Acc:10613]
# we have 8 rows now becuase CCL5 has two matching ensgenes mapped to different locations
sessionInfo
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.5.0 (2018-04-23)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2018-07-16
## Packages -----------------------------------------------------------------
## package * version date
## annotables * 0.1.91 2018-06-18
## AnnotationDbi 1.42.1 2018-05-08
## assertthat 0.2.0 2017-04-11
## backports 1.1.2 2017-12-13
## base * 3.5.0 2018-04-24
## bindr 0.1.1 2018-03-13
## bindrcpp 0.2.2 2018-03-29
## Biobase 2.40.0 2018-05-01
## BiocGenerics 0.26.0 2018-05-01
## biomaRt 2.36.1 2018-05-24
## bit 1.1-13 2018-05-15
## bit64 0.9-7 2017-05-08
## bitops 1.0-6 2013-08-17
## blob 1.1.1 2018-03-25
## blogdown 0.8.1 2018-07-16
## bookdown 0.7 2018-02-18
## broom 0.4.4 2018-03-29
## cellranger 1.1.0 2016-07-27
## cli 1.0.0 2017-11-05
## colorspace 1.3-2 2016-12-14
## compiler 3.5.0 2018-04-24
## crayon 1.3.4 2017-09-16
## curl 3.2 2018-03-28
## datasets * 3.5.0 2018-04-24
## DBI 1.0.0 2018-05-02
## devtools 1.13.5 2018-02-18
## digest 0.6.15 2018-01-28
## dplyr * 0.7.6 2018-06-29
## evaluate 0.10.1 2017-06-24
## forcats * 0.3.0 2018-02-19
## foreign 0.8-70 2017-11-28
## ggplot2 * 3.0.0 2018-07-03
## glue 1.2.0 2017-10-29
## graphics * 3.5.0 2018-04-24
## grDevices * 3.5.0 2018-04-24
## grid 3.5.0 2018-04-24
## gtable 0.2.0 2016-02-26
## haven 1.1.1 2018-01-18
## hms 0.4.2 2018-03-10
## htmltools 0.3.6 2017-04-28
## httr 1.3.1 2017-08-20
## IRanges 2.14.10 2018-05-16
## jsonlite 1.5 2017-06-01
## knitr 1.20 2018-02-20
## lattice 0.20-35 2017-03-25
## lazyeval 0.2.1 2017-10-29
## lubridate 1.7.4 2018-04-11
## magrittr 1.5 2014-11-22
## memoise 1.1.0 2017-04-21
## methods * 3.5.0 2018-04-24
## mnormt 1.5-5 2016-10-15
## modelr 0.1.2 2018-05-11
## munsell 0.4.3 2016-02-13
## nlme 3.1-137 2018-04-07
## parallel 3.5.0 2018-04-24
## pillar 1.2.3 2018-05-25
## pkgconfig 2.0.1 2017-03-21
## plyr 1.8.4 2016-06-08
## prettyunits 1.0.2 2015-07-13
## progress 1.1.2 2016-12-14
## psych 1.8.4 2018-05-06
## purrr * 0.2.4 2017-10-18
## R6 2.2.2 2017-06-17
## Rcpp 0.12.17 2018-05-18
## RCurl 1.95-4.10 2018-01-04
## readr * 1.1.1 2017-05-16
## readxl 1.1.0 2018-04-20
## reshape2 1.4.3 2017-12-11
## rlang 0.2.1 2018-05-30
## rmarkdown 1.10 2018-06-11
## rprojroot 1.3-2 2018-01-03
## RSQLite 2.1.1 2018-05-06
## rstudioapi 0.7 2017-09-07
## rvest 0.3.2 2016-06-17
## S4Vectors 0.18.2 2018-05-16
## scales 0.5.0 2017-08-24
## stats * 3.5.0 2018-04-24
## stats4 3.5.0 2018-04-24
## stringi 1.2.2 2018-05-02
## stringr * 1.3.1 2018-05-10
## tibble * 1.4.2 2018-01-22
## tidyr * 0.8.1 2018-05-18
## tidyselect 0.2.4 2018-02-26
## tidyverse * 1.2.1 2017-11-14
## tools 3.5.0 2018-04-24
## utf8 1.1.4 2018-05-24
## utils * 3.5.0 2018-04-24
## withr 2.1.2 2018-03-15
## xfun 0.3 2018-07-06
## XML 3.98-1.11 2018-04-16
## xml2 1.2.0 2018-01-24
## yaml 2.1.19 2018-05-01
## source
## Github (stephenturner/annotables@958545a)
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Github (rstudio/blogdown@d54c39a)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@0.7.6)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@3.0.0)
## CRAN (R 3.5.0)
## local
## local
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@1.10)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## CRAN (R 3.5.0)
## local
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## cran (@0.3)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)