Site icon R-bloggers

Reading data from the new version of Google Spreadsheets

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Spreadsheets remain an important way for people to share and work with data. Among other providers, Google has provided the ability to create online spreadsheets and other documents.

Back in 2009, David Smith posted a blog entry on how to use R, and specifically the XML package to import data from a Google Spreadsheet. Once you marked your Google sheet as exported, it took about two lines of code to import your data into a data frame.

But things have changed

More recently, it seems that Google changed and improved the Spreadsheet product. Google's own overview of changes lists some changes, but one change isn't on this list. In the previous version, it was possible to publish a sheet as a csv file. In the new version it is still possible to publish a sheet, but the ability to do this as csv is no longer there.

On April 5, 2014 somebody asked a question on StackOverflow on how to deal with this.

Because I had the same need to import data from a spreadsheet shared in our team, I set out to find and answer.

Quick overview of publishing a sheet in the new version of Google docs

To publish a Google Docs spreadsheet is really as simple as following these three steps:

  • Create a google sheet
  • Publish to web
  • Copy the document link

R code to read the Google data

Here is the code. You will need to load the XML package before using this.

The function

readGoogleSheet()

returns a list of data frames, one for each table found on the Google sheet:

library(XML)

readGoogleSheet <- function(url, na.string="", header=TRUE){

  stopifnot(require(XML))

  # Suppress warnings because Google docs seems to have incomplete final line

  suppressWarnings({

    doc <- paste(readLines(url), collapse=" ")

  })

  if(nchar(doc) == 0) stop("No content found")

  htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)

  ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)

  lapply(ret, function(x){ x[ x == na.string] <- NA; x})

}

Next, we need a function to clean the individual tables.

cleanGoogleTable()

removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:

cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){

  if(!is.data.frame(dat)){

    dat <- dat[[table]]

  }

  if(is.na(dropFirstCol)) {

    firstCol <- na.omit(dat[[1]])

    if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {

      dat <- dat[, -1]

    }

  } else if(dropFirstCol) {

    dat <- dat[, -1]

  }

  if(skip > 0){

    dat <- dat[-seq_len(skip), ]

  }

  if(nrow(dat) == 1) return(dat)

  if(nrow(dat) >= 2){

    if(all(is.na(dat[2, ]))) dat <- dat[-2, ]

  }

  if(header && nrow(dat) > 1){

    header <- as.character(dat[1, ])

    names(dat) <- header

    dat <- dat[-1, ]

  }

  # Keep only desired columns

  if(!is.na(ncols)){

    ncols <- min(ncols, ncol(dat))

    dat <- dat[, seq_len(ncols)]

  }

  # Keep only desired rows

  if(nrows > 0){

    nrows <- min(nrows, nrow(dat))

    dat <- dat[seq_len(nrows), ]

  }

  # Rename rows

  rownames(dat) <- seq_len(nrow(dat))

  dat

}

See it in action

I created a Google Spreadsheet with data about the periodic table of elements. I then published this sheet. If you follow the link, you will notice that the document contains two sheets. Sheet1 contains the periodic table data.

To read the table, try the following three lines of code:

gdoc <- "https://docs.google.com/spreadsheets/d/1MQ50_tn76GqQAOpFigcHms4zFqkoM_JS4sOittv_vgA/pubhtml"

elem <- readGoogleSheet(gdoc)

m <- cleanGoogleTable(elem, table=1)

head(m)

    Atomic no          Name Symbol Group Period Block State at STP Occurrence

  1         1      Hydrogen      H     1      1     s          Gas Primordial

  2         2        Helium     He    18      1     s          Gas Primordial

  3         3       Lithium     Li     1      2     s        Solid Primordial

  4         4     Beryllium     Be     2      2     s        Solid Primordial

  5         5         Boron      B    13      2     p        Solid Primordial

  6         6        Carbon      C    14      2     p        Solid Primordial

tail(m)

    Atomic no          Name Symbol Group Period Block State at STP Occurrence

113       113   (Ununtrium)    Uut    13      7     p               Synthetic

114       114 (Ununquadium)    Uuq    14      7     p               Synthetic

115       115 (Ununpentium)    Uup    15      7     p               Synthetic

116       116  (Ununhexium)    Uuh    16      7     p               Synthetic

117       117 (Ununseptium)    Uus    17      7     p               Synthetic

118       118  (Ununoctium)    Uuo    18      7     p               Synthetic

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.