Annotables: R data package for annotating/converting Gene IDs

Stephen Turner

6 years ago

[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I work with gene lists on a nearly daily basis. Lists of genes near ChIP-seq peaks, lists of genes closest to a GWAS hit, lists of differentially expressed genes or transcripts from an RNA-seq experiment, lists of genes involved in certain pathways, etc. And lots of times I’ll need to convert these gene IDs from one identifier to another. There’s no shortage of tools to do this. I use Ensembl Biomart. But I do this so often that I got tired of hammering Ensembl’s servers whenever I wanted to convert from Ensembl to Entrez gene IDs for pathway mapping, get the chromosomal location for some BEDTools-y kinds of genomic arithmetic, or get the gene symbol and full description for reporting. So I used Biomart to retrieve the data that I use most often, cleaned up the column names, and saved this data as an R data package called annotables.

This package has basic annotation information from Ensembl release 82 for:

Human (grch38)
Mouse (grcm38)
Rat (rnor6)
Chicken (galgal4)
Worm (wbcel235)
Fly (bdgp6)

Where each table contains:

ensgene: Ensembl gene ID
entrez: Entrez gene ID
symbol: Gene symbol
chr: Chromosome
start: Start
end: End
strand: Strand
biotype: Protein coding, pseudogene, mitochondrial tRNA, etc.
description: Full gene name/description.

Additionally, there are tables for human and mouse (grch38_gt and grcm38_gt, respectively) that link ensembl gene IDs to ensembl transcript IDs.

Usage

The package isn’t on CRAN, so you’ll need devtools to install it.

# If you haven't already installed devtools...
install.packages("devtools")

# Use devtools to install the package
devtools::install_github("stephenturner/annotables")

It isn’t necessary to load dplyr, but the tables are tbl_df and will print nicely if you have dplyr loaded.

library(dplyr)
library(annotables)

Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):

grch38

## Source: local data frame [66,531 x 9]
## 
##            ensgene entrez  symbol   chr start   end strand        biotype
##              (chr)  (int)   (chr) (chr) (int) (int)  (int)          (chr)
## 1  ENSG00000210049     NA   MT-TF    MT   577   647      1        Mt_tRNA
## 2  ENSG00000211459     NA MT-RNR1    MT   648  1601      1        Mt_rRNA
## 3  ENSG00000210077     NA   MT-TV    MT  1602  1670      1        Mt_tRNA
## 4  ENSG00000210082     NA MT-RNR2    MT  1671  3229      1        Mt_rRNA
## 5  ENSG00000209082     NA  MT-TL1    MT  3230  3304      1        Mt_tRNA
## 6  ENSG00000198888   4535  MT-ND1    MT  3307  4262      1 protein_coding
## 7  ENSG00000210100     NA   MT-TI    MT  4263  4331      1        Mt_tRNA
## 8  ENSG00000210107     NA   MT-TQ    MT  4329  4400     -1        Mt_tRNA
## 9  ENSG00000210112     NA   MT-TM    MT  4402  4469      1        Mt_tRNA
## 10 ENSG00000198763   4536  MT-ND2    MT  4470  5511      1 protein_coding
## ..             ...    ...     ...   ...   ...   ...    ...            ...
## Variables not shown: description (chr)

Look at the human genes-to-transcripts table:

grch38_gt

## Source: local data frame [216,133 x 2]
## 
##            ensgene          enstxp
##              (chr)           (chr)
## 1  ENSG00000210049 ENST00000387314
## 2  ENSG00000211459 ENST00000389680
## 3  ENSG00000210077 ENST00000387342
## 4  ENSG00000210082 ENST00000387347
## 5  ENSG00000209082 ENST00000386347
## 6  ENSG00000198888 ENST00000361390
## 7  ENSG00000210100 ENST00000387365
## 8  ENSG00000210107 ENST00000387372
## 9  ENSG00000210112 ENST00000387377
## 10 ENSG00000198763 ENST00000361453
## ..             ...             ...

Tables are tbl_df, pipe-able with dplyr:

grch38 %>% 
  filter(biotype=="protein_coding" & chr=="1") %>% 
  select(ensgene, symbol, chr, start, end, description) %>% 
  head %>% 
  pander::pandoc.table(split.table=100, justify="llllll", style="rmarkdown")

ensgene	symbol	chr	start	end
ENSG00000158014	SLC30A2	1	26037252	26046133
ENSG00000173673	HES3	1	6244192	6245578
ENSG00000243749	ZMYM6NB	1	34981535	34985353
ENSG00000189410	SH2D5	1	20719732	20732837
ENSG00000116863	ADPRHL2	1	36088875	36093932
ENSG00000188643	S100A16	1	153606886	153613145

Table: Table continues below

description
solute carrier family 30 (zinc transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:11013]
hes family bHLH transcription factor 3 [Source:HGNC Symbol;Acc:HGNC:26226]
ZMYM6 neighbor [Source:HGNC Symbol;Acc:HGNC:40021]
SH2 domain containing 5 [Source:HGNC Symbol;Acc:HGNC:28819]
ADP-ribosylhydrolase like 2 [Source:HGNC Symbol;Acc:HGNC:21304]
S100 calcium binding protein A16 [Source:HGNC Symbol;Acc:HGNC:20441]

Example with RNA-seq data

Here’s an example with RNA-seq data. Specifically, DESeq2 results from the airway package, made tidy with biobroom:

# Load libraries (install with Bioconductor if you don't have them)
library(DESeq2)
library(airway)

# Load the data and do the RNA-seq data analysis
data(airway)
airway = DESeqDataSet(airway, design = ~cell + dex)
airway = DESeq(airway)
res = results(airway)

# tidy results with biobroom
library(biobroom)
res_tidy = tidy.DESeqResults(res)
head(res_tidy)

## Source: local data frame [6 x 7]
## 
##              gene    baseMean    estimate   stderror  statistic
##             (chr)       (dbl)       (dbl)      (dbl)      (dbl)
## 1 ENSG00000000003 708.6021697  0.37424998 0.09873107  3.7906000
## 2 ENSG00000000005   0.0000000          NA         NA         NA
## 3 ENSG00000000419 520.2979006 -0.20215550 0.10929899 -1.8495642
## 4 ENSG00000000457 237.1630368 -0.03624826 0.13684258 -0.2648902
## 5 ENSG00000000460  57.9326331  0.08523370 0.24654400  0.3457140
## 6 ENSG00000000938   0.3180984  0.11555962 0.14630523  0.7898530
## Variables not shown: p.value (dbl), p.adjusted (dbl)

Now, make a table with the results (unfortunately, it’ll be split in this display, but you can write this to file to see all the columns in a single row):

res_tidy %>% 
  arrange(p.adjusted) %>% 
  head(20) %>% 
  inner_join(grch38, by=c("gene"="ensgene")) %>% 
  select(gene, estimate, p.adjusted, symbol, description) %>% 
  pander::pandoc.table(split.table=100, justify="lrrll", style="rmarkdown")

gene	estimate	p.adjusted	symbol
ENSG00000152583	-4.316	4.753e-134	SPARCL1
ENSG00000165995	-3.189	1.44e-133	CACNB2
ENSG00000101347	-3.618	6.619e-125	SAMHD1
ENSG00000120129	-2.871	6.619e-125	DUSP1
ENSG00000189221	-3.231	9.468e-119	MAOA
ENSG00000211445	-3.553	3.94e-107	GPX3
ENSG00000157214	-1.949	8.74e-102	STEAP2
ENSG00000162614	-2.003	3.052e-98	NEXN
ENSG00000125148	-2.167	1.783e-92	MT2A
ENSG00000154734	-2.286	4.522e-86	ADAMTS1
ENSG00000139132	-2.181	2.501e-83	FGD4
ENSG00000162493	-1.858	4.215e-83	PDPN
ENSG00000162692	3.453	3.563e-82	VCAM1
ENSG00000179094	-3.044	1.199e-81	PER1
ENSG00000134243	-2.149	2.73e-81	SORT1
ENSG00000163884	-4.079	1.073e-80	KLF15
ENSG00000178695	2.446	6.275e-75	KCTD12
ENSG00000146250	2.64	1.143e-69	PRSS35
ENSG00000198624	-2.784	1.707e-69	CCDC69
ENSG00000148848	1.783	1.762e-69	ADAM12

Table: Table continues below

description
SPARC-like 1 (hevin) [Source:HGNC Symbol;Acc:HGNC:11220]
calcium channel, voltage-dependent, beta 2 subunit [Source:HGNC Symbol;Acc:HGNC:1402]
SAM domain and HD domain 1 [Source:HGNC Symbol;Acc:HGNC:15925]
dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]
monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]
glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]
STEAP family member 2, metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]
nexilin (F actin binding protein) [Source:HGNC Symbol;Acc:HGNC:29557]
metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]
ADAM metallopeptidase with thrombospondin type 1 motif, 1 [Source:HGNC Symbol;Acc:HGNC:217]
FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]
podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]
vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]
period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]
sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]
Kruppel-like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]
potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]
protease, serine, 35 [Source:HGNC Symbol;Acc:HGNC:21387]
coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]
ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]

Explore!

This data can also be used for toying around with dplyr verbs and generally getting a sense of what’s in here. First, tet some help.

ls("package:annotables")
?grch38

Let’s join the transcript table to the gene table.

gt = grch38_gt %>% 
  inner_join(grch38, by="ensgene")

Now, let’s filter to get only protein-coding genes, group by the ensembl gene ID, summarize to count how many transcripts are in each gene, inner join that result back to the original gene list, so we can select out only the gene, number of transcripts, symbol, and description, mutate the description column so that it isn’t so wide that it’ll break the display, arrange the returned data descending by the number of transcripts per gene, head to get the top 10 results, and optionally, pipe that to further utilities to output a nice HTML table.

gt %>% 
  filter(biotype=="protein_coding") %>% 
  group_by(ensgene) %>% 
  summarize(ntxps=n_distinct(enstxp)) %>% 
  inner_join(grch38, by="ensgene") %>% 
  select(ensgene, ntxps, symbol, description) %>% 
  mutate(description=substr(description, 1, 20)) %>% 
  arrange(desc(ntxps)) %>% 
  head(10) %>% 
  pander::pandoc.table(split.table=100, justify="lrll", style="rmarkdown")

ensgene	ntxps	symbol	description
ENSG00000165795	77	NDRG2	NDRG family member 2
ENSG00000205336	77	ADGRG1	adhesion G protein-c
ENSG00000196628	75	TCF4	transcription factor
ENSG00000161249	68	DMKN	dermokine [Source:HG
ENSG00000154556	64	SORBS2	sorbin and SH3 domai
ENSG00000166444	62	ST5	suppression of tumor
ENSG00000204580	58	DDR1	discoidin domain rec
ENSG00000087460	57	GNAS	GNAS complex locus [
ENSG00000169398	57	PTK2	protein tyrosine kin
ENSG00000104529	56	EEF1D	eukaryotic translati

Let’s look up DMKN (dermkine) in Ensembl. Search Ensembl for ENSG00000161249, or use this direct link. You can browse the table or graphic to see the splicing complexity in this gene.

Or, let’s do something different. Let’s group the data by what type of gene it is (e.g., protein coding, pseudogene, etc), get the number of genes in each category, and plot the top 20.

library(ggplot2)
grch38 %>% 
  group_by(biotype) %>% 
  summarize(n=n_distinct(ensgene)) %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  ggplot(aes(reorder(biotype, n), n)) + 
  geom_bar(stat="identity") + 
  xlab("Type") + 
  theme_bw() + 
  coord_flip()

Annotables: R data package for annotating/converting Gene IDs

Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.