Creating Annotated Data Frames from GEO with the GEOquery package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Mining gene expression data from publicly available databases is a great way to find evidence to support you working hypothesis that gene X is relevant in condition Y. You may also want to mine publicly available data to build on an existing hypothesis or simply to find additional support for your favorite gene in a different animal model or experimental condition. In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.
First, let’s start by opening an R session and creating a function to return the eset (ExpressionSet) object or the original list object downloaded by the getGEO() function in R.
getGEOdataObjects <- function(x, getGSEobject=FALSE){ # Make sure the GEOquery package is installed require("GEOquery") # Use the getGEO() function to download the GEO data for the id stored in x GSEDATA <- getGEO(x, GSEMatrix=T, AnnotGPL=FALSE) # Inspect the object by printing a summary of the expression values for the first 2 columns print(summary(exprs(GSEDATA[[1]])[, 1:2])) # Get the eset object eset <- GSEDATA[[1]] # Save the objects generated for future use in the current working directory save(GSEDATA, eset, file=paste(x, ".RData", sep="")) # check whether we want to return the list object we downloaded on GEO or # just the eset object with the getGSEobject argument if(getGSEobject) return(GSEDATA) else return(eset) }
We can test this function on a GEO dataset such as GSE73835 as follows:
# Store the dataset ids in a vector GEO_DATASETS just in case you want to loop through several GEO ids GEO_DATASETS <- c("GSE73835") # Use the function we created to return the eset object eset <- getGEOdataObjects(GEO_DATASETS[1]) # Inspect the eset object to get the annotation GPL id eset
You will see the following output:
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45281 features, 6 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1904293 GSM1904294 … GSM1904298 (6 total)
varLabels: title geo_accession … data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: ILMN_1212602 ILMN_1212603 … ILMN_3163582 (45281 total)
fvarLabels: ID Species … ORF (30 total)
fvarMetadata: Column Description labelDescription
experimentData: use ‘experimentData(object)’
Annotation: GPL6887
We will first need to download the annotation file for GPL6887. Then we can create a data frame with the probe annotation categories we are most interested in as follows:
# Get the annotation GPL id (see Annotation: GPL10558) gpl <- getGEO('GPL6887', destdir=".") Meta(gpl)$title # Inspect the table of the gpl annotation object colnames(Table(gpl)) # Get the gene symbol and entrez ids to be used for annotations Table(gpl)[1:10, c(1, 6, 9, 12)] dim(Table(gpl)) # Get the gene expression data for all the probes with a gene symbol geneProbes <- which(!is.na(Table(gpl)$Symbol)) probeids <- as.character(Table(gpl)$ID[geneProbes]) probes <- intersect(probeids, rownames(exprs(eset))) length(probes) geneMatrix <- exprs(eset)[probes, ] inds <- which(Table(gpl)$ID %in% probes) # Check you get the same probes head(probes) head(as.character(Table(gpl)$ID[inds])) # Create the expression matrix with gene ids geneMatTable <- cbind(geneMatrix, Table(gpl)[inds, c(1, 6, 9, 12)]) head(geneMatTable) # Save a copy of the expression matrix as a csv file write.csv(geneMatTable, paste(GEO_DATASETS[1], "_DataMatrix.csv", sep=""), row.names=T)
Let’s take a look at the first 6 lines of the data frame we just created with the head() function.
As you can see once we export this data frame as a csv file, it is much easier for others to open this file as a spreadsheet and get useful information such as the gene symbol or entrez id with the expression values across the samples.
Hope this helps and happy collaborations!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.