BioMart (and biomaRt)

nsaunders

12 years ago

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.

The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.

You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.

Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.

library(biomaRt)
# define biomart object
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# read in the file
genes <- read.csv("myfile.csv")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc), mart = mart)
# sample results
  ensembl_gene_id hgnc_symbol
1 ENSG00000082397     EPB41L3
2 ENSG00000168461       RAB31
3 ENSG00000176014       TUBB6
4 ENSG00000154734     ADAMTS1
5 ENSG00000197766         CFD
6 ENSG00000156284       CLDN8

You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.

Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.

library(simpleaffy)
library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# assume that we are using the human exon array from Affymetrix
# read in .CEL files and RMA normalise
data <- read.affy()
data@cdfName <- "exon.pmcdf"
data.rma <- rma(data)
data.ex <- as.data.frame(exprs(data.rma))
# The attribute for exon array probesets is named "affy_huex_1_0_st_v2"
affy <- "affy_huex_1_0_st_v2"
# Next line would take a very long time for all exon probesets!
# We would probably select a subset of data.ex first
genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart)
# Now match the array data probesets with the genes data frame
m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2)
# And append e.g. the HGNC symbol to the array data frame
data.ex$hgnc <- genes[m, "hgnc_symbol"]
# sample result
            Con1     Con2   Treat1   Treat2   hgnc
2315603 7.164521 7.107470 7.827158 7.307056 TTLL10
2315610 6.135751 6.259306 6.691880 6.532974 TTLL10
2315614 3.017279 4.602484 5.058326 5.349798 TTLL10
2315647 5.740181 5.373581 5.885912 5.756925   <NA>
2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D
2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D
2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D
2315736 5.882034 6.250097 6.292414 6.311813   <NA>
2315741 5.314087 5.471424 5.762590 5.896435  PUSL1
2315768 2.278067 1.652001 2.430359 2.310668   <NA>
2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3
2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3

Summary
That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.

Filed under: programming, R, research diary, statistics Tagged: biomart, data integration, ensembl