Quick Guide to Gene Name Conversion

Background

There are several popular naming systems for (human) genes:

  1. RefSeq (NM_000350)
  2. Ensembl (ENSG00000198691)
  3. HGNC Symbol (ABCA4)
  4. Entrez (24)

Given enough time in #bioinformatics, you will have to do every possible combination of conversions.

This post will very briefly explain the most expedient way to automatically convert between these formats with R.

Ensembl <-> HGNC <-> Entrez

Stephen Turner has built a small set of data frames (well, tibbles) with core information, including transcript <-> gene info. You just install the library, run library(annotables) and you have tibbles for the info. Super easy.

https://github.com/stephenturner/annotables

## install steps, run once
# install.packages("devtools")
# devtools::install_github("stephenturner/annotables")

library(annotables)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
grch38 %>% head()
## # A tibble: 6 x 9
##   ensgene  entrez symbol chr    start    end strand biotype description   
##   <chr>     <int> <chr>  <chr>  <int>  <int>  <int> <chr>   <chr>         
## 1 ENSG000…   7105 TSPAN6 X     1.01e8 1.01e8     -1 protei… tetraspanin 6…
## 2 ENSG000…  64102 TNMD   X     1.01e8 1.01e8      1 protei… tenomodulin […
## 3 ENSG000…   8813 DPM1   20    5.09e7 5.10e7     -1 protei… dolichyl-phos…
## 4 ENSG000…  57147 SCYL3  1     1.70e8 1.70e8     -1 protei… SCY1 like pse…
## 5 ENSG000…  55732 C1orf… 1     1.70e8 1.70e8      1 protei… chromosome 1 …
## 6 ENSG000…   2268 FGR    1     2.76e7 2.76e7     -1 protei… FGR proto-onc…
# or grch37, grcm38, rnor6, galgal5, wbcel235, bdgp6, mmul801

But, he did not add Refseq names. So if you need to get RefSeq names into one of the others, you’ll have to do another step.

biomaRt (RefSeq <-> (Ensembl <-> HGNC <-> Entrez))

Ensembl’s biomaRt tool is super powerful. And very annoying to use for me, as I find the syntax impossible to remember. Also it takes over the dplyr select function with its own select. You’ll notice here I do not load biomaRt.

But it can convert just about anything to anything.

## install steps, run once
# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
# library(biomaRt) # <- don't load!, just use the :: 
mart<- biomaRt::useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
# mapping example
refseq_ids <- c("NM_006573", "NM_002985", "NM_032965", "NM_002987", "NM_006274", "NM_004591", "NM_002990")
  
refseq_mapping <- biomaRt::getBM(attributes = c("refseq_mrna","hgnc_symbol"), 
                        filters="refseq_mrna", # you swap out of this filter for whatever your input is
                        values=refseq_ids, # vector of your NMf
                        mart=mart)

refseq_mapping 
##   refseq_mrna hgnc_symbol
## 1   NM_002985        CCL5
## 2   NM_002987       CCL17
## 3   NM_002990       CCL22
## 4   NM_004591       CCL20
## 5   NM_006274       CCL19
## 6   NM_006573    TNFSF13B
## 7   NM_032965       CCL15

If you want to get the rest of the info in Annotables matched up with the RefSeq NM, then just do a left_join

left_join(refseq_mapping %>% select(refseq_mrna, symbol = hgnc_symbol), grch37)
## Joining, by = "symbol"
##   refseq_mrna   symbol         ensgene entrez         chr     start
## 1   NM_002985     CCL5 ENSG00000161570   6352          17  34198495
## 2   NM_002985     CCL5 ENSG00000271503   6352 HG385_PATCH  34198510
## 3   NM_002987    CCL17 ENSG00000102970   6361          16  57438679
## 4   NM_002990    CCL22 ENSG00000102962   6367          16  57392684
## 5   NM_004591    CCL20 ENSG00000115009   6364           2 228678558
## 6   NM_006274    CCL19 ENSG00000172724   6363           9  34689564
## 7   NM_006573 TNFSF13B ENSG00000102524  10673          13 108903588
## 8   NM_032965    CCL15 ENSG00000267596   6359          17  34323476
##         end strand        biotype
## 1  34207797     -1 protein_coding
## 2  34207812     -1 protein_coding
## 3  57449974      1 protein_coding
## 4  57400102      1 protein_coding
## 5 228682272      1 protein_coding
## 6  34691274     -1 protein_coding
## 7 108960832      1 protein_coding
## 8  34329084     -1 protein_coding
##                                                                             description
## 1                         chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 2                         chemokine (C-C motif) ligand 5 [Source:HGNC Symbol;Acc:10632]
## 3                        chemokine (C-C motif) ligand 17 [Source:HGNC Symbol;Acc:10615]
## 4                        chemokine (C-C motif) ligand 22 [Source:HGNC Symbol;Acc:10621]
## 5                        chemokine (C-C motif) ligand 20 [Source:HGNC Symbol;Acc:10619]
## 6                        chemokine (C-C motif) ligand 19 [Source:HGNC Symbol;Acc:10617]
## 7 tumor necrosis factor (ligand) superfamily, member 13b [Source:HGNC Symbol;Acc:11929]
## 8                        chemokine (C-C motif) ligand 15 [Source:HGNC Symbol;Acc:10613]
# we have 8 rows now becuase CCL5 has two matching ensgenes mapped to different locations

sessionInfo

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.0 (2018-04-23)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2018-07-16
## Packages -----------------------------------------------------------------
##  package       * version   date      
##  annotables    * 0.1.91    2018-06-18
##  AnnotationDbi   1.42.1    2018-05-08
##  assertthat      0.2.0     2017-04-11
##  backports       1.1.2     2017-12-13
##  base          * 3.5.0     2018-04-24
##  bindr           0.1.1     2018-03-13
##  bindrcpp        0.2.2     2018-03-29
##  Biobase         2.40.0    2018-05-01
##  BiocGenerics    0.26.0    2018-05-01
##  biomaRt         2.36.1    2018-05-24
##  bit             1.1-13    2018-05-15
##  bit64           0.9-7     2017-05-08
##  bitops          1.0-6     2013-08-17
##  blob            1.1.1     2018-03-25
##  blogdown        0.8.1     2018-07-16
##  bookdown        0.7       2018-02-18
##  broom           0.4.4     2018-03-29
##  cellranger      1.1.0     2016-07-27
##  cli             1.0.0     2017-11-05
##  colorspace      1.3-2     2016-12-14
##  compiler        3.5.0     2018-04-24
##  crayon          1.3.4     2017-09-16
##  curl            3.2       2018-03-28
##  datasets      * 3.5.0     2018-04-24
##  DBI             1.0.0     2018-05-02
##  devtools        1.13.5    2018-02-18
##  digest          0.6.15    2018-01-28
##  dplyr         * 0.7.6     2018-06-29
##  evaluate        0.10.1    2017-06-24
##  forcats       * 0.3.0     2018-02-19
##  foreign         0.8-70    2017-11-28
##  ggplot2       * 3.0.0     2018-07-03
##  glue            1.2.0     2017-10-29
##  graphics      * 3.5.0     2018-04-24
##  grDevices     * 3.5.0     2018-04-24
##  grid            3.5.0     2018-04-24
##  gtable          0.2.0     2016-02-26
##  haven           1.1.1     2018-01-18
##  hms             0.4.2     2018-03-10
##  htmltools       0.3.6     2017-04-28
##  httr            1.3.1     2017-08-20
##  IRanges         2.14.10   2018-05-16
##  jsonlite        1.5       2017-06-01
##  knitr           1.20      2018-02-20
##  lattice         0.20-35   2017-03-25
##  lazyeval        0.2.1     2017-10-29
##  lubridate       1.7.4     2018-04-11
##  magrittr        1.5       2014-11-22
##  memoise         1.1.0     2017-04-21
##  methods       * 3.5.0     2018-04-24
##  mnormt          1.5-5     2016-10-15
##  modelr          0.1.2     2018-05-11
##  munsell         0.4.3     2016-02-13
##  nlme            3.1-137   2018-04-07
##  parallel        3.5.0     2018-04-24
##  pillar          1.2.3     2018-05-25
##  pkgconfig       2.0.1     2017-03-21
##  plyr            1.8.4     2016-06-08
##  prettyunits     1.0.2     2015-07-13
##  progress        1.1.2     2016-12-14
##  psych           1.8.4     2018-05-06
##  purrr         * 0.2.4     2017-10-18
##  R6              2.2.2     2017-06-17
##  Rcpp            0.12.17   2018-05-18
##  RCurl           1.95-4.10 2018-01-04
##  readr         * 1.1.1     2017-05-16
##  readxl          1.1.0     2018-04-20
##  reshape2        1.4.3     2017-12-11
##  rlang           0.2.1     2018-05-30
##  rmarkdown       1.10      2018-06-11
##  rprojroot       1.3-2     2018-01-03
##  RSQLite         2.1.1     2018-05-06
##  rstudioapi      0.7       2017-09-07
##  rvest           0.3.2     2016-06-17
##  S4Vectors       0.18.2    2018-05-16
##  scales          0.5.0     2017-08-24
##  stats         * 3.5.0     2018-04-24
##  stats4          3.5.0     2018-04-24
##  stringi         1.2.2     2018-05-02
##  stringr       * 1.3.1     2018-05-10
##  tibble        * 1.4.2     2018-01-22
##  tidyr         * 0.8.1     2018-05-18
##  tidyselect      0.2.4     2018-02-26
##  tidyverse     * 1.2.1     2017-11-14
##  tools           3.5.0     2018-04-24
##  utf8            1.1.4     2018-05-24
##  utils         * 3.5.0     2018-04-24
##  withr           2.1.2     2018-03-15
##  xfun            0.3       2018-07-06
##  XML             3.98-1.11 2018-04-16
##  xml2            1.2.0     2018-01-24
##  yaml            2.1.19    2018-05-01
##  source                                   
##  Github (stephenturner/annotables@958545a)
##  Bioconductor                             
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  Bioconductor                             
##  Bioconductor                             
##  Bioconductor                             
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  Github (rstudio/blogdown@d54c39a)        
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  cran (@0.7.6)                            
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  cran (@3.0.0)                            
##  CRAN (R 3.5.0)                           
##  local                                    
##  local                                    
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  Bioconductor                             
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  cran (@1.10)                             
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  Bioconductor                             
##  CRAN (R 3.5.0)                           
##  local                                    
##  local                                    
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  local                                    
##  CRAN (R 3.5.0)                           
##  cran (@0.3)                              
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)                           
##  CRAN (R 3.5.0)

Related

comments powered by Disqus