Annotating limma Results with Gene Names for Affy Microarrays

Posted on January 17, 2012 by Stephen Turner in Uncategorized | 0 Comments

[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lately I’ve been using the limma package often for analyzing microarray data. When I read in Affy CEL files using ReadAffy(), the resulting ExpressionSet won’t contain any featureData annotation. Consequentially, when I run topTable to get a list of differentially expressed genes, there’s no annotation information other than the Affymetrix probeset IDs or transcript cluster IDs. There are other ways of annotating these results (INNER JOIN to a MySQL database, biomaRt, etc), but I would like to have the output from topTable already annotated with gene information. Ideally, I could annotate each probeset ID with a gene symbol, gene name, Ensembl ID, and have that Ensembl ID hyperlink out to the Ensembl genome browser. With some help from Gordon Smyth on the Bioconductor Mailing list, I found that annotating the ExpressionSet object results in the output from topTable also being annotated.

The results from topTable are pretty uninformative without annotation:

	ID logFC AveExpr t P.Value adj.P.Val B
	7902702 -6.8 7.8 -46 1.0e-09 3.3e-05 11.0
	8117594 -4.3 10.0 -33 9.6e-09 1.5e-04 10.0
	8168794 -3.6 6.5 -30 1.7e-08 1.9e-04 9.7
	8103736 -3.4 7.7 -28 3.1e-08 2.5e-04 9.3
	7897426 -4.0 7.2 -26 4.7e-08 3.1e-04 9.0
	8094278 -3.7 7.0 -24 7.4e-08 3.7e-04 8.7

view raw noanno.txt hosted with ❤ by GitHub

After annotation:

	ID Symbol Name Ensembl logFC AveExpr t P.Value adj.P.Val B
	7902702 CLCA2 chloride channel accessory 2 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000137975'>ENSG00000137975</a> -6.832901 7.793344 -46.35621 1.015175e-09 3.281148e-05 11.177075
	8117594 HIST1H2BM histone cluster 1, H2bm <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000196374'>ENSG00000196374</a> -4.255756 10.096873 -33.21938 9.579122e-09 1.548034e-04 10.063255
	8168794 CENPI centromere protein I <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000102384'>ENSG00000102384</a> -3.612035 6.486830 -30.41465 1.733246e-08 1.867342e-04 9.698515
	8103736 SCRG1 stimulator of chondrogenesis 1 <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000164106'>ENSG00000164106</a> -3.433737 7.690753 -27.94797 3.058621e-08 2.471442e-04 9.321687
	7897426 CA6 carbonic anhydrase VI <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000131686'>ENSG00000131686</a> -4.048937 7.173867 -26.18729 4.732396e-08 3.059115e-04 9.014277
	8094278 NCAPG non-SMC condensin I complex, subunit G <a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000109805'>ENSG00000109805</a> -3.658660 7.008561 -24.48718 7.419854e-08 3.743457e-04 8.681846

view raw afteranno.txt hosted with ❤ by GitHub

You can generate an HTML file with clickable links to the Ensembl Genome Browser for each gene:

Here’s the R code to do it:

	# Install the necessary bioconductor packages. Use the appropriate annotation.db file
	# for your organism/platform. See http://www.bioconductor.org/packages/release/data/annotation/
	source("http://www.bioconductor.org/biocLite.R")
	biocLite("affy")
	biocLite("limma")
	biocLite("annotate")
	biocLite("hugene10sttranscriptcluster.db")
	install.packages("R2HTML")

	# Load required packages
	library(affy)
	library(limma)
	library(annotate)
	library(hugene10sttranscriptcluster.db)
	library(R2HTML)

	# RMA to convert Affybatch to ExpressionSet using robust multichip average
	eset <- rma(myaffybatch)
	class(eset)

	# Which platform?
	eset@annotation

	# Which objects are in this annotation package?
	ls("package:hugene10sttranscriptcluster.db")

	# Get the transcript cluster IDs from the expressionset
	ID <- featureNames(eset)

	# Look up the Gene Symbol, name, and Ensembl Gene ID for each of those IDs
	Symbol <- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
	Name <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
	Ensembl <- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENSEMBL"))
	# Note: looking up Ensembl gene IDs by Affy probe IDs can sometimes result in multiple
	# gene IDs being returned. This is why the default output from lookUp is a list.
	# I'm coercing this list into a character vector so it's the same size as my list of IDs,
	# But this is discarding information where one probe ID maps to multiple gene IDs.
	# If you're aware of a better solution, please leave it in the comments! Thanks.

	# Wherever you have a missing Ensembl ID, make hyperlink missing also.
	# If you have an Ensembl ID, create a hyperlink that goes to the
	# Ensembl genome browser. Change the URL if you're not using human.
	Ensembl <- ifelse(Ensembl=="NA", NA,
	paste("<a href='http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=",
	Ensembl, "'>", Ensembl, "</a>", sep=""))

	# Make a temporary data frame with all those identifiers
	tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name, Ensembl=Ensembl, stringsAsFactors=F)

	# The stringsAsFactors makes "NA" characters. This fixes that problem.
	tmp[tmp=="NA"] <- NA

	# set the featureData for your expressionset using the data frame you created above.
	fData(eset) <- tmp

	# Clean up
	rm(ID, Symbol, Name, Ensembl, tmp)

	# Fit the model
	fit <- lmFit(eset,design)
	fit <- eBayes(fit)
	tt <- topTable(fit, coef="tissue")

	# Write out an HTML file with clickable links to the Ensembl Genome Browser.
	HTML(tt, "out.html", append=F)

view raw annotatelimma.r hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Annotating limma Results with Gene Names for Affy Microarrays

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)