Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.
The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.
You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.
Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.
library(biomaRt) # define biomart object mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") # read in the file genes <- read.csv("myfile.csv") # query biomart results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc), mart = mart) # sample results ensembl_gene_id hgnc_symbol 1 ENSG00000082397 EPB41L3 2 ENSG00000168461 RAB31 3 ENSG00000176014 TUBB6 4 ENSG00000154734 ADAMTS1 5 ENSG00000197766 CFD 6 ENSG00000156284 CLDN8
You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.
Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.
library(simpleaffy) library(biomaRt) mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") # assume that we are using the human exon array from Affymetrix # read in .CEL files and RMA normalise data <- read.affy() data@cdfName <- "exon.pmcdf" data.rma <- rma(data) data.ex <- as.data.frame(exprs(data.rma)) # The attribute for exon array probesets is named "affy_huex_1_0_st_v2" affy <- "affy_huex_1_0_st_v2" # Next line would take a very long time for all exon probesets! # We would probably select a subset of data.ex first genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart) # Now match the array data probesets with the genes data frame m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2) # And append e.g. the HGNC symbol to the array data frame data.ex$hgnc <- genes[m, "hgnc_symbol"] # sample result Con1 Con2 Treat1 Treat2 hgnc 2315603 7.164521 7.107470 7.827158 7.307056 TTLL10 2315610 6.135751 6.259306 6.691880 6.532974 TTLL10 2315614 3.017279 4.602484 5.058326 5.349798 TTLL10 2315647 5.740181 5.373581 5.885912 5.756925 <NA> 2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D 2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D 2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D 2315736 5.882034 6.250097 6.292414 6.311813 <NA> 2315741 5.314087 5.471424 5.762590 5.896435 PUSL1 2315768 2.278067 1.652001 2.430359 2.310668 <NA> 2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3 2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3
Summary
That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.
Filed under: programming, R, research diary, statistics Tagged: biomart, data integration, ensembl
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.