Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?
At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.
The R code first, then some explanation:
# install GEOmetadb source("http://bioconductor.org/biocLite.R") biocLite("GEOmetadb") library(GEOmetadb) # connect to database getSQLiteFile() con <- dbConnect(SQLite(), "GEOmetadb.sqlite") # count samples per GDS gds.count <- dbGetQuery(con, "select gds,sample_count from gds") gds.count[1:5,] # first 5 results gds sample_count 1 GDS5 5 2 GDS6 29 3 GDS10 28 4 GDS12 8 5 GDS15 6 # count samples per GSE gse <- dbGetQuery(con, "select series_id from gsm") gse.count <- as.data.frame(table(gse$series_id)) gse.count[1:10,] # first 10 results Var1 Freq 1 GSE1 38 2 GSE10 4 3 GSE100 4 4 GSE10000 29 5 GSE10001 12 6 GSE10002 8 7 GSE10003 4 8 GSE10004,GSE10114 3 9 GSE10005 48 10 GSE10006 75
We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.
Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).
GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:
Var1 Freq 15 GSE10011,GSE10026 45 14652 GSE9973,GSE10026 9 14654 GSE9975,GSE10026 36 14656 GSE9977,GSE10026 24
Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.
Posted in bioinformatics, computing, R, statistics Tagged: database, geo, microarray, ncbi
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.