Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.
When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
1. What marts and datasets are available?
The function listMarts()
returns a data frame with mart names and descriptions.
library(biomaRt) marts <- listMarts() head(marts) biomart version 1 ensembl ENSEMBL GENES 79 (SANGER UK) 2 snp ENSEMBL VARIATION 79 (SANGER UK) 3 regulation ENSEMBL REGULATION 79 (SANGER UK) 4 vega VEGA 59 (SANGER UK) 5 fungi_mart_26 ENSEMBL FUNGI 26 (EBI UK) 6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)
The first 4 of those are what you would see if you visited Ensembl BioMart on the Web and clicked the “choose database” drop-down box.
Given a mart object, listDatasets()
returns the available datasets. We get a mart object using useMart()
.
datasets <- listDatasets(useMart("ensembl")) head(datasets) dataset description version 1 oanatinus_gene_ensembl Ornithorhynchus anatinus genes (OANA5) OANA5 2 cporcellus_gene_ensembl Cavia porcellus genes (cavPor3) cavPor3 3 gaculeatus_gene_ensembl Gasterosteus aculeatus genes (BROADS1) BROADS1 4 lafricana_gene_ensembl Loxodonta africana genes (loxAfr3) loxAfr3 5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2 6 choffmanni_gene_ensembl Choloepus hoffmanni genes (choHof1) choHof1
2. How to specify a different genome build version?
Assuming that you want human genes in the most recent Ensembl build, you supply mart and dataset information from the previous step to useMart()
like this:
mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")
What if you want an older genome build? First, visit the Ensembl Archives page for more information. Next, use the URL information there to supply a host = argument. For example, to see what marts are available for Ensembl version 72 (genome build GRCh37.p11):
listMarts(host = "jun2013.archive.ensembl.org") biomart version 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 72 2 ENSEMBL_MART_SNP Ensembl Variation 72 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 72 4 ENSEMBL_MART_VEGA Vega 52 5 pride PRIDE (EBI UK)
Available datasets can be found as before with the addition of the host = argument to useMart()
. To create the version 72 mart object:
mart72.hs <- useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", host = "jun2013.archive.ensembl.org")
3. What filters and attributes can I use?
Let’s review the BioMart terminology.
Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53″, “SRY” and “KIAA1199″.
But how do you know what attributes and filters are available? If you guessed listAttributes()
and listFilters()
, you’d be right.
attributes <- listAttributes(mart.hs) head(attributes) name description 1 ensembl_gene_id Ensembl Gene ID 2 ensembl_transcript_id Ensembl Transcript ID 3 ensembl_peptide_id Ensembl Protein ID 4 ensembl_exon_id Ensembl Exon ID 5 description Description 6 chromosome_name Chromosome Name filters <- listFilters(mart.hs) head(filters) name description 1 chromosome_name Chromosome name 2 start Gene Start (bp) 3 end Gene End (bp) 4 band_start Band Start 5 band_end Band End 6 marker_start Marker Start
You can search for specific attributes by running grep()
on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:
attributes[grep("affy", attributes$name),] name description 102 affy_hc_g110 Affy HC G110 probeset 103 affy_hg_focus Affy HG FOCUS probeset 104 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset 105 affy_hg_u133a_2 Affy HG U133A_2 probeset 106 affy_hg_u133a Affy HG U133A probeset 107 affy_hg_u133b Affy HG U133B probeset 108 affy_hg_u95av2 Affy HG U95AV2 probeset 109 affy_hg_u95b Affy HG U95B probeset 110 affy_hg_u95c Affy HG U95C probeset 111 affy_hg_u95d Affy HG U95D probeset 112 affy_hg_u95e Affy HG U95E probeset 113 affy_hg_u95a Affy HG U95A probeset 114 affy_hugenefl Affy HuGene FL probeset 115 affy_primeview Affy primeview 116 affy_u133_x3p Affy U133 X3P probeset
4. A worked example
As an example, let’s find SNPs on the Y chromosome which have starts that are located within exons. First, we create two mart objects for the genes and SNP datasets. We query the first mart for exons on chromosome Y and the second for SNPs on the same chromosome. Finally we’ll employ the incredibly-useful findOverlaps()
function from the GenomicRanges package to get the exonic SNPs.
There are various ways to run BioMart queries; I find the simplest is to use getBM()
.
library(biomaRt) library(GenomicRanges) mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl") mart.hs_snp <- useMart("snp", "hsapiens_snp") exons <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end", "strand"), filters = "chromosome_name", values = "Y", mart = mart.hs) snps <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start"), filters = "chr_name", values = "Y", mart = mart.hs_snp) gr.exons <- GRanges(seqnames = "chrY", ranges = IRanges(start = exons$exon_chrom_start, end = exons$exon_chrom_end, names = exons$ensembl_exon_id)) gr.snps <- GRanges(seqnames = "chrY", ranges = IRanges(start = snps$chrom_start, end = snps$chrom_start, names = snps$refsnp_id)) ov <- findOverlaps(gr.snps, gr.exons, type = "within") # ov is a matrix with 2 columns # col 1 = row numbers for query hits (snps); col 2 = row numbers for subject hits (exons) exons_snps <- cbind(exons[ov@subjectHits, ], snps[ov@queryHits, ]) head(exons_snps) ensembl_gene_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end strand refsnp_id chr_name chrom_start 1967 ENSG00000231341 ENSE00001732601 Y 5207215 5208069 -1 rs4913 Y 5207285 1792 ENSG00000226092 ENSE00001743398 Y 21927521 21927819 -1 rs3802 Y 21927601 588 ENSG00000224657 ENSE00001783563 Y 22658581 22658878 1 rs3802 Y 22658799 2231 ENSG00000215540 ENSE00001693911 Y 23537308 23537606 -1 rs3802 Y 23537388 74 ENSG00000215507 ENSE00001733694 Y 26133010 26133308 1 rs3802 Y 26133228 275 ENSG00000215414 ENSE00001542224 Y 13286638 13287378 1 rs15434 Y 13287334
Filed under: bioinformatics, programming, R, statistics Tagged: biomart, how to, tutorial
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.