Site icon R-bloggers

Datagrabbing Commonly Formatted Sheets from a Google Spreadsheet – Guardian 2014 University Guide Data

[This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So it seems like it’s that time of year when the Guardian publish their university rankings data (Datablog: University guide 2014), which means another opportunity to have a tinker and see what I’ve learned since last year…

(Last year’s hack was a Filtering Guardian University Data Every Which Way You Can…, where I had a quick go at creating a simple interactive dashboard viewer and charter over the Guardian tables, as published via Google spreadsheets.)

The data is published via a Google spreadsheet using a standard sheet layout (apart from the first two sheets, with gid numbers 0 and 1 respectively). Sheets 2..47 are formatted as follows:

The following R function provides the basis for downloading the data from a single sheet, given the spreadsheet key and the sheet gid, and puts the data into a dataframe.

library(RCurl)
gsqAPI = function(key,query,gid=0){
  tmp=getURL( paste( sep="",'https://spreadsheets.google.com/tq?', 'tqx=out:csv','&tq=', curlEscape(query), '&key=', key, '&gid=', gid), ssl.verifypeer = FALSE )
  return( read.csv( textConnection( tmp ),  stringsAsFactors=F ) )
}

The query is a query written using the SQL-like Google visualisation API query language.

The following routine will download non-empty rows from a specific sheet, and rename the subject specific rank column to a generically titled column (“Subject Rank”). An additional column is added to the table that denotes what subject the table is associated with.

handler=function(key,i){
  tmp=gsqAPI(key,"select * where B!=''", i)
  subject=sub(".Rank",'',colnames(tmp)[1])
  colnames(tmp)[1]="Subject.Rank"
  tmp$subject=subject
  tmp
}

We can now download the first subject specific sheet using the following call:

key='0Aq73qj3QslDedHFNdUsxOVZQZ1dmbXNrYlZGUWgwdHc'
gdata=handler(key,2)

This loads in the data as follows:

We can then add data to this dataframe from all the other sheets, to give us a single data from containing ranking data for all the subjects listed.

for (i in 3:47){
  gdata=rbind(gdata,handler(key,i))
}

(This is probably not a very R idiomatic (iRonic?) way of doing things, but it seems to do the trick…)

The result is a single dataframe containing all the subject specific data from the Guardian spreadsheets.

If we check the data types of the columns in the full data set, we can see that most of the columns that should be viewed as numeric types as instead being treated as characters.

We can fix this with a one liner, that forces the type on the columns we select (the fourth to the eleventh column):

gdata[, 4:11] <- sapply(gdata[, 4:11], as.numeric)
#Cast the university names and subjects to levelled factors
gdata$Name.of.Institution=as.factor(gdata$Name.of.Institution)
gdata$subject=as.factor(gdata$subject)

One advantage of combining the data from the separate sheets into a monolithic data table is that we can see the ranking across all subject areas for a given university. For example:

oxfordbrookes.rankings = subset(gdata, Name.of.Institution=='Oxford Brookes', select=c('subject','Subject.Rank') )
#Let's also sort the results:
oxfordbrookes.rankings = oxfordbrookes.rankings[ order(oxfordbrookes.rankings['Subject.Rank']), ]

We can also start to quiz the data in a graphical way. For example, can we get a feel for how courses are distributed within a university according to teaching and satisfaction levels, whilst also paying heed to the value add score?

library(ggplot2)
oxfordbrookes.full = subset(gdata, Name.of.Institution=='Oxford Brookes' )
ggplot(oxfordbrookes.full) + geom_text(aes(x=X..Satisfied.with.Teaching, y=X..Satisfied.with.Assessment, label=subject,size=Value.added.score.10))

All told, it took maybe a couple of hours of faffing around trying to remember R syntax and cope with https hassles to get this far (including a chunk of time to write this post;-). But now we’re in a position to start hacking out quick queries and having a proper conversation with the data. The work can also be viewed as disposable tool building – it hasn’t required a project proposal, it hasn’t used lots of third party developer time etc etc.

As it is though, I’m not sure how useful getting this far is to anyone who isn’t willing to have a go at hacking te data for themselves…

Hmmm… maybe I should try to sketch out a quick Shiny app…..?


To leave a comment for the author, please follow the link and comment on their blog: OUseful.Info, the blog... » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.