Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.
ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" ) f <- listFilters( ensembl ) a <- listAttributes( ensembl )
The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.
g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" ) d <- getBM( attributes= c( "entrezgene", "description", "uniprot_genename" ), filters= "hgnc_symbol", values= g, mart= ensembl )
This returns the following:
entrezgene description uniprot_genename 1 6347 chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618] CCL2 2 1634 decorin [Source:HGNC Symbol;Acc:2705] DCN 3 3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018] IL6 4 3976 leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596] LIF 5 5328 plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052] PLAU
OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.
g <- c( "BTC" ) d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )
will return two lines:
uniprot_genename description ensembl_gene_id 1 BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530 2 BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808
We can use aggregate to collapse the ensembl IDs:
ff <- function( x ) { x <- x[ !is.na( x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) } d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]
The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:
Group.1 uniprot_genename description ensembl_gene_id 1 BTC BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.