Reading data from the new version of Google Spreadsheets
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Spreadsheets remain an important way for people to share and work with data. Among other providers, Google has provided the ability to create online spreadsheets and other documents.
Back in 2009, David Smith posted a blog entry on how to use R, and specifically the XML package to import data from a Google Spreadsheet. Once you marked your Google sheet as exported, it took about two lines of code to import your data into a data frame.
But things have changed
More recently, it seems that Google changed and improved the Spreadsheet product. Google's own overview of changes lists some changes, but one change isn't on this list. In the previous version, it was possible to publish a sheet as a csv file. In the new version it is still possible to publish a sheet, but the ability to do this as csv is no longer there.
On April 5, 2014 somebody asked a question on StackOverflow on how to deal with this.
Because I had the same need to import data from a spreadsheet shared in our team, I set out to find and answer.
Quick overview of publishing a sheet in the new version of Google docs
To publish a Google Docs spreadsheet is really as simple as following these three steps:
- Create a google sheet
- Publish to web
- Copy the document link
R code to read the Google data
Here is the code. You will need to load the XML package before using this.
The function
readGoogleSheet()
returns a list of data frames, one for each table found on the Google sheet:
library(XML) readGoogleSheet <- function(url, na.string="", header=TRUE){ stopifnot(require(XML)) # Suppress warnings because Google docs seems to have incomplete final line suppressWarnings({ doc <- paste(readLines(url), collapse=" ") }) if(nchar(doc) == 0) stop("No content found") htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc) ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE) lapply(ret, function(x){ x[ x == na.string] <- NA; x}) }
Next, we need a function to clean the individual tables.
cleanGoogleTable()
removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){ if(!is.data.frame(dat)){ dat <- dat[[table]] } if(is.na(dropFirstCol)) { firstCol <- na.omit(dat[[1]]) if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) { dat <- dat[, -1] } } else if(dropFirstCol) { dat <- dat[, -1] } if(skip > 0){ dat <- dat[-seq_len(skip), ] } if(nrow(dat) == 1) return(dat) if(nrow(dat) >= 2){ if(all(is.na(dat[2, ]))) dat <- dat[-2, ] } if(header && nrow(dat) > 1){ header <- as.character(dat[1, ]) names(dat) <- header dat <- dat[-1, ] } # Keep only desired columns if(!is.na(ncols)){ ncols <- min(ncols, ncol(dat)) dat <- dat[, seq_len(ncols)] } # Keep only desired rows if(nrows > 0){ nrows <- min(nrows, nrow(dat)) dat <- dat[seq_len(nrows), ] } # Rename rows rownames(dat) <- seq_len(nrow(dat)) dat }
See it in action
I created a Google Spreadsheet with data about the periodic table of elements. I then published this sheet. If you follow the link, you will notice that the document contains two sheets. Sheet1 contains the periodic table data.
To read the table, try the following three lines of code:
gdoc <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/pubhtml" elem <- readGoogleSheet(gdoc) m <- cleanGoogleTable(elem, table=1) head(m) Atomic no Name Symbol Group Period Block State at STP Occurrence 1 1 Hydrogen H 1 1 s Gas Primordial 2 2 Helium He 18 1 s Gas Primordial 3 3 Lithium Li 1 2 s Solid Primordial 4 4 Beryllium Be 2 2 s Solid Primordial 5 5 Boron B 13 2 p Solid Primordial 6 6 Carbon C 14 2 p Solid Primordial tail(m) Atomic no Name Symbol Group Period Block State at STP Occurrence 113 113 (Ununtrium) Uut 13 7 p Synthetic 114 114 (Ununquadium) Uuq 14 7 p Synthetic 115 115 (Ununpentium) Uup 15 7 p Synthetic 116 116 (Ununhexium) Uuh 16 7 p Synthetic 117 117 (Ununseptium) Uus 17 7 p Synthetic 118 118 (Ununoctium) Uuo 18 7 p Synthetic
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.