Background
There are several popular naming systems for (human) genes:
- RefSeq (NM_000350)
- Ensembl (ENSG00000198691)
- HGNC Symbol (ABCA4)
- Entrez (24)
Given enough time in #bioinformatics, you will have to do every possible combination of conversions.
This post will very briefly explain the most expedient way to automatically convert between these formats with R.
More exhaustive resources
http://crazyhottommy.blogspot.com/2014/09/converting-gene-ids-using-bioconductor.html
https://davetang.org/muse/2013/11/25/thoughts-converting-gene-identifiers/
Ensembl <-> HGNC <-> Entrez
Stephen Turner has built a small set of data frames (well, tibbles) with core information, including transcript <-> gene info. You just install the library, run library(annotables) and you have tibbles for the info. Super easy.
https://github.com/stephenturner/annotables
## install steps, run once
# install.packages("devtools")
# devtools::install_github("stephenturner/annotables")
library(annotables)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
grch38 %>% head()
## # A tibble: 6 x 9
## ensgene entrez symbol chr start end strand biotype description
## <chr> <int> <chr> <chr> <int> <int> <int> <chr> <chr>
## 1 ENSG000… 7105 TSPAN6 X 1.01e8 1.01e8 -1 protei… tetraspanin 6…
## 2 ENSG000… 64102 TNMD X 1.01e8 1.01e8 1 protei… tenomodulin […
## 3 ENSG000… 8813 DPM1 20 5.09e7 5.10e7 -1 protei… dolichyl-phos…
## 4 ENSG000… 57147 SCYL3 1 1.70e8 1.70e8 -1 protei… SCY1 like pse…
## 5 ENSG000… 55732 C1orf… 1 1.70e8 1.70e8 1 protei… chromosome 1 …
## 6 ENSG000… 2268 FGR 1 2.76e7 2.76e7 -1 protei… FGR proto-onc…
# or grch37, grcm38, rnor6, galgal5, wbcel235, bdgp6, mmul801
But, he did not add Refseq names. So if you need to get RefSeq names into one of the others, you’ll have to do another step.
biomaRt (RefSeq <-> (Ensembl <-> HGNC <-> Entrez))
Ensembl’s biomaRt tool is super powerful. And very annoying to use for me, as I find the syntax impossible to remember. Also it takes over the dplyr select function with its own select. You’ll notice here I do not load biomaRt.
But it can convert just about anything to anything.
## install steps, run once
# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
# library(biomaRt) # <- don't load!, just use the ::
mart<- biomaRt::useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
# mapping example
refseq_ids <- c("NM_006573", "NM_002985", "NM_032965", "NM_002987", "NM_006274", "NM_004591", "NM_002990")
refseq_mapping <- biomaRt::getBM(attributes = c("refseq_mrna","hgnc_symbol"),
filters="refseq_mrna", # you swap out of this filter for whatever your input is
values=refseq_ids, # vector of your NMf
mart=mart)
refseq_mapping
## refseq_mrna hgnc_symbol
## 1 NM_002985 CCL5
## 2 NM_002987 CCL17
## 3 NM_002990 CCL22
## 4 NM_004591 CCL20
## 5 NM_006274 CCL19
## 6 NM_006573 TNFSF13B
## 7 NM_032965 CCL15
If you want to get the rest of the info in Annotables matched up with the RefSeq NM, then just do a left_join
left_join(refseq_mapping %>% select(refseq_mrna, symbol = hgnc_symbol), grch37)
## Joining, by = "symbol"
## refseq_mrna symbol ensgene entrez chr start
## 1 NM_002985 CCL5 ENSG00000161570 6352 17 34198495
## 2 NM_002985 CCL5 ENSG00000271503 6352 HG385_PATCH 34198510
## 3 NM_002987 CCL17 ENSG00000102970 6361 16 57438679
## 4 NM_002990 CCL22 ENSG00000102962 6367 16 57392684
## 5 NM_004591 CCL20 ENSG00000115009 6364 2 228678558
## 6 NM_006274 CCL19 ENSG00000172724 6363 9 34689564
## 7 NM_006573 TNFSF13B ENSG00000102524 10673 13 108903588
## 8 NM_032965 CCL15 ENSG00000267596 6359 17 34323476
## end strand biotype
## 1 34207797 -1 protein_coding
## 2 34207812 -1 protein_coding
## 3 57449974 1 protein_coding
## 4 57400102 1 protein_coding
## 5 228682272 1 protein_coding
## 6 34691274 -1 protein_coding
## 7 108960832 1 protein_coding
## 8 34329084 -1 protein_coding
## description
## 1 chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 2 chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 3 chemokine (C-C motif) ligand 17 [Source:HGNC Symbol;Acc:10615]
## 4 chemokine (C-C motif) ligand 22 [Source:HGNC Symbol;Acc:10621]
## 5 chemokine (C-C motif) ligand 20 [Source:HGNC Symbol;Acc:10619]
## 6 chemokine (C-C motif) ligand 19 [Source:HGNC Symbol;Acc:10617]
## 7 tumor necrosis factor (ligand) superfamily, member 13b [Source:HGNC Symbol;Acc:11929]
## 8 chemokine (C-C motif) ligand 15 [Source:HGNC Symbol;Acc:10613]
# we have 8 rows now becuase CCL5 has two matching ensgenes mapped to different locations
sessionInfo
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.5.0 (2018-04-23)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2018-07-16
## Packages -----------------------------------------------------------------
## package * version date
## annotables * 0.1.91 2018-06-18
## AnnotationDbi 1.42.1 2018-05-08
## assertthat 0.2.0 2017-04-11
## backports 1.1.2 2017-12-13
## base * 3.5.0 2018-04-24
## bindr 0.1.1 2018-03-13
## bindrcpp 0.2.2 2018-03-29
## Biobase 2.40.0 2018-05-01
## BiocGenerics 0.26.0 2018-05-01
## biomaRt 2.36.1 2018-05-24
## bit 1.1-13 2018-05-15
## bit64 0.9-7 2017-05-08
## bitops 1.0-6 2013-08-17
## blob 1.1.1 2018-03-25
## blogdown 0.8.1 2018-07-16
## bookdown 0.7 2018-02-18
## broom 0.4.4 2018-03-29
## cellranger 1.1.0 2016-07-27
## cli 1.0.0 2017-11-05
## colorspace 1.3-2 2016-12-14
## compiler 3.5.0 2018-04-24
## crayon 1.3.4 2017-09-16
## curl 3.2 2018-03-28
## datasets * 3.5.0 2018-04-24
## DBI 1.0.0 2018-05-02
## devtools 1.13.5 2018-02-18
## digest 0.6.15 2018-01-28
## dplyr * 0.7.6 2018-06-29
## evaluate 0.10.1 2017-06-24
## forcats * 0.3.0 2018-02-19
## foreign 0.8-70 2017-11-28
## ggplot2 * 3.0.0 2018-07-03
## glue 1.2.0 2017-10-29
## graphics * 3.5.0 2018-04-24
## grDevices * 3.5.0 2018-04-24
## grid 3.5.0 2018-04-24
## gtable 0.2.0 2016-02-26
## haven 1.1.1 2018-01-18
## hms 0.4.2 2018-03-10
## htmltools 0.3.6 2017-04-28
## httr 1.3.1 2017-08-20
## IRanges 2.14.10 2018-05-16
## jsonlite 1.5 2017-06-01
## knitr 1.20 2018-02-20
## lattice 0.20-35 2017-03-25
## lazyeval 0.2.1 2017-10-29
## lubridate 1.7.4 2018-04-11
## magrittr 1.5 2014-11-22
## memoise 1.1.0 2017-04-21
## methods * 3.5.0 2018-04-24
## mnormt 1.5-5 2016-10-15
## modelr 0.1.2 2018-05-11
## munsell 0.4.3 2016-02-13
## nlme 3.1-137 2018-04-07
## parallel 3.5.0 2018-04-24
## pillar 1.2.3 2018-05-25
## pkgconfig 2.0.1 2017-03-21
## plyr 1.8.4 2016-06-08
## prettyunits 1.0.2 2015-07-13
## progress 1.1.2 2016-12-14
## psych 1.8.4 2018-05-06
## purrr * 0.2.4 2017-10-18
## R6 2.2.2 2017-06-17
## Rcpp 0.12.17 2018-05-18
## RCurl 1.95-4.10 2018-01-04
## readr * 1.1.1 2017-05-16
## readxl 1.1.0 2018-04-20
## reshape2 1.4.3 2017-12-11
## rlang 0.2.1 2018-05-30
## rmarkdown 1.10 2018-06-11
## rprojroot 1.3-2 2018-01-03
## RSQLite 2.1.1 2018-05-06
## rstudioapi 0.7 2017-09-07
## rvest 0.3.2 2016-06-17
## S4Vectors 0.18.2 2018-05-16
## scales 0.5.0 2017-08-24
## stats * 3.5.0 2018-04-24
## stats4 3.5.0 2018-04-24
## stringi 1.2.2 2018-05-02
## stringr * 1.3.1 2018-05-10
## tibble * 1.4.2 2018-01-22
## tidyr * 0.8.1 2018-05-18
## tidyselect 0.2.4 2018-02-26
## tidyverse * 1.2.1 2017-11-14
## tools 3.5.0 2018-04-24
## utf8 1.1.4 2018-05-24
## utils * 3.5.0 2018-04-24
## withr 2.1.2 2018-03-15
## xfun 0.3 2018-07-06
## XML 3.98-1.11 2018-04-16
## xml2 1.2.0 2018-01-24
## yaml 2.1.19 2018-05-01
## source
## Github (stephenturner/annotables@958545a)
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Github (rstudio/blogdown@d54c39a)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@0.7.6)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@3.0.0)
## CRAN (R 3.5.0)
## local
## local
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## cran (@1.10)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## Bioconductor
## CRAN (R 3.5.0)
## local
## local
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## local
## CRAN (R 3.5.0)
## cran (@0.3)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)
## CRAN (R 3.5.0)