GEO database: curation lagging behind submission?
[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.
Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records? |
library(GEOmetadb) library(ggplot2) # update database if required using getSQLiteFile() # connect to database; assumed to be in user $HOME con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite") # fetch "last updated" dates for GDS and GSE gds <- dbGetQuery(con, "select update_date from gds") gse <- dbGetQuery(con, "select last_update_date from gse") # cumulative sums by date; no factor variables gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F) gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F) # make GDS and GSE data frames comparable colnames(gds.count) <- "count" colnames(gse.count) <- "count" # row names (dates) to real dates gds.count$date <- as.POSIXct(rownames(gds.count)) gse.count$date <- as.POSIXct(rownames(gse.count)) # add type for plotting gds.count$type <- "gds" gse.count$type <- "gse" # combine GDS and GSE data frames gds.gse <- rbind(gds.count, gse.count) # and plot records over time by type png(filename = "geometadb.png", width = 800, height = 600) print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type))) dev.off()
Filed under: bioinformatics, R, statistics, web resources Tagged: bioconductor, database, geo, geometadb, ggplot2, microarray
To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.