Site icon R-bloggers

Creating Annotated Data Frames from GEO with the GEOquery package

[This article was first published on Let's talk about science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mining gene expression data from publicly available databases is a great way to find evidence to support you working hypothesis that gene X is relevant in condition Y. You may also want to mine publicly available data to build on an existing hypothesis or simply to find additional support for your favorite gene in a different animal model or experimental condition. In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.

First, let’s start by opening an R session and creating a function to return the eset (ExpressionSet) object or the original list object downloaded by the getGEO() function in R.


getGEOdataObjects <- function(x, getGSEobject=FALSE){
# Make sure the GEOquery package is installed
require("GEOquery")
# Use the getGEO() function to download the GEO data for the id stored in x
GSEDATA <- getGEO(x, GSEMatrix=T, AnnotGPL=FALSE)
# Inspect the object by printing a summary of the expression values for the first 2 columns
print(summary(exprs(GSEDATA[[1]])[, 1:2]))

# Get the eset object
eset <- GSEDATA[[1]]
# Save the objects generated for future use in the current working directory
save(GSEDATA, eset, file=paste(x, ".RData", sep=""))

# check whether we want to return the list object we downloaded on GEO or
# just the eset object with the getGSEobject argument
if(getGSEobject) return(GSEDATA) else return(eset)
}

We can test this function on a GEO dataset such as GSE73835 as follows:


# Store the dataset ids in a vector GEO_DATASETS just in case you want to loop through several GEO ids
GEO_DATASETS <- c("GSE73835")

# Use the function we created to return the eset object
eset <- getGEOdataObjects(GEO_DATASETS[1])
# Inspect the eset object to get the annotation GPL id
eset 

You will see the following output:

ExpressionSet (storageMode: lockedEnvironment)
assayData: 45281 features, 6 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1904293 GSM1904294 … GSM1904298 (6 total)
varLabels: title geo_accession … data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: ILMN_1212602 ILMN_1212603 … ILMN_3163582 (45281 total)
fvarLabels: ID Species … ORF (30 total)
fvarMetadata: Column Description labelDescription
experimentData: use ‘experimentData(object)’
Annotation: GPL6887

We will first need to download the annotation file for GPL6887. Then we can create a data frame with the probe annotation categories we are most interested in as follows:


# Get the annotation GPL id (see Annotation: GPL10558)
gpl <- getGEO('GPL6887', destdir=".")
Meta(gpl)$title

# Inspect the table of the gpl annotation object
colnames(Table(gpl))

# Get the gene symbol and entrez ids to be used for annotations
Table(gpl)[1:10, c(1, 6, 9, 12)]
dim(Table(gpl))

# Get the gene expression data for all the probes with a gene symbol
geneProbes <- which(!is.na(Table(gpl)$Symbol))
probeids <- as.character(Table(gpl)$ID[geneProbes])

probes <- intersect(probeids, rownames(exprs(eset)))
length(probes)

geneMatrix <- exprs(eset)[probes, ]

inds <- which(Table(gpl)$ID %in% probes)
# Check you get the same probes
head(probes)
head(as.character(Table(gpl)$ID[inds]))

# Create the expression matrix with gene ids
geneMatTable <- cbind(geneMatrix, Table(gpl)[inds, c(1, 6, 9, 12)])
head(geneMatTable)

# Save a copy of the expression matrix as a csv file
write.csv(geneMatTable, paste(GEO_DATASETS[1], "_DataMatrix.csv", sep=""), row.names=T)

Let’s take a look at the first 6 lines of the data frame we just created with the head() function.

As you can see once we export this data frame as a csv file, it is much easier for others to open this file as a spreadsheet and get useful information such as the gene symbol or entrez id with the expression values across the samples.

Hope this helps and happy collaborations!


To leave a comment for the author, please follow the link and comment on their blog: Let's talk about science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.