Annotating limma Results with Gene Names for Affy Microarrays
[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lately I’ve been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won’t contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there’s no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The results from topTable are pretty uninformative without annotation:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ID logFC AveExpr t P.Value adj.P.Val B | |
7902702 -6.8 7.8 -46 1.0e-09 3.3e-05 11.0 | |
8117594 -4.3 10.0 -33 9.6e-09 1.5e-04 10.0 | |
8168794 -3.6 6.5 -30 1.7e-08 1.9e-04 9.7 | |
8103736 -3.4 7.7 -28 3.1e-08 2.5e-04 9.3 | |
7897426 -4.0 7.2 -26 4.7e-08 3.1e-04 9.0 | |
8094278 -3.7 7.0 -24 7.4e-08 3.7e-04 8.7 |
After annotation:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ID Symbol Name Ensembl logFC AveExpr t P.Value adj.P.Val B | |
7902702 CLCA2 chloride channel accessory 2 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000137975'>ENSG00000137975</a> -6.832901 7.793344 -46.35621 1.015175e-09 3.281148e-05 11.177075 | |
8117594 HIST1H2BM histone cluster 1, H2bm <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000196374'>ENSG00000196374</a> -4.255756 10.096873 -33.21938 9.579122e-09 1.548034e-04 10.063255 | |
8168794 CENPI centromere protein I <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000102384'>ENSG00000102384</a> -3.612035 6.486830 -30.41465 1.733246e-08 1.867342e-04 9.698515 | |
8103736 SCRG1 stimulator of chondrogenesis 1 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000164106'>ENSG00000164106</a> -3.433737 7.690753 -27.94797 3.058621e-08 2.471442e-04 9.321687 | |
7897426 CA6 carbonic anhydrase VI <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000131686'>ENSG00000131686</a> -4.048937 7.173867 -26.18729 4.732396e-08 3.059115e-04 9.014277 | |
8094278 NCAPG non-SMC condensin I complex, subunit G <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000109805'>ENSG00000109805</a> -3.658660 7.008561 -24.48718 7.419854e-08 3.743457e-04 8.681846 |
You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:
Here’s the R code to do it:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Install the necessary bioconductor packages. Use the appropriate annotation.db file | |
# for your organism/platform. See http://www.bioconductor.org/packages/release/data/annotation/ | |
source("http://www.bioconductor.org/biocLite.R") | |
biocLite("affy") | |
biocLite("limma") | |
biocLite("annotate") | |
biocLite("hugene10sttranscriptcluster.db") | |
install.packages("R2HTML") | |
# Load required packages | |
library(affy) | |
library(limma) | |
library(annotate) | |
library(hugene10sttranscriptcluster.db) | |
library(R2HTML) | |
# RMA to convert Affybatch to ExpressionSet using robust multichip average | |
eset <- rma(myaffybatch) | |
class(eset) | |
# Which platform? | |
eset@annotation | |
# Which objects are in this annotation package? | |
ls("package:hugene10sttranscriptcluster.db") | |
# Get the transcript cluster IDs from the expressionset | |
ID <- featureNames(eset) | |
# Look up the Gene Symbol, name, and Ensembl Gene ID for each of those IDs | |
Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db") | |
Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME")) | |
Ensembl <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENSEMBL")) | |
# Note: looking up Ensembl gene IDs by Affy probe IDs can sometimes result in multiple | |
# gene IDs being returned. This is why the default output from lookUp is a list. | |
# I'm coercing this list into a character vector so it's the same size as my list of IDs, | |
# But this is discarding information where one probe ID maps to multiple gene IDs. | |
# If you're aware of a better solution, please leave it in the comments! Thanks. | |
# Wherever you have a missing Ensembl ID, make hyperlink missing also. | |
# If you have an Ensembl ID, create a hyperlink that goes to the | |
# Ensembl genome browser. Change the URL if you're not using human. | |
Ensembl <- ifelse(Ensembl=="NA", NA, | |
paste("<a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=", | |
Ensembl, "'>", Ensembl, "</a>", sep="")) | |
# Make a temporary data frame with all those identifiers | |
tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name, Ensembl=Ensembl, stringsAsFactors=F) | |
# The stringsAsFactors makes "NA" characters. This fixes that problem. | |
tmp[tmp=="NA"] <- NA | |
# set the featureData for your expressionset using the data frame you created above. | |
fData(eset) <- tmp | |
# Clean up | |
rm(ID, Symbol, Name, Ensembl, tmp) | |
# Fit the model | |
fit <- lmFit(eset,design) | |
fit <- eBayes(fit) | |
tt <- topTable(fit, coef="tissue") | |
# Write out an HTML file with clickable links to the Ensembl Genome Browser. | |
HTML(tt, "out.html", append=F) |
To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.