Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user’s best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?
As part of exploring digital data collection we used a range of sources that provide JSON data – from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:
getStats <- function(id) { url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json") raw.data <- readLines(url, warn = "F") rd <- fromJSON(raw.data) dop <- as.character(rd$entry$published) term <- rd$entry$category[[2]]["term"] label <- rd$entry$category[[2]]["label"] title <- rd$entry$title author <- rd$entry$author[[1]]$name duration <- rd$entry$`media$group`$`media$content`[[1]]["duration"] favs <- rd$entry$`yt$statistics`["favoriteCount"] views <- rd$entry$`yt$statistics`["viewCount"] dislikes <- rd$entry$`yt$rating`["numDislikes"] likes <- rd$entry$`yt$rating`["numLikes"] return(list(id, dop, term, label, title, author, duration, favs, views, dislikes, likes)) } (getStats("Ya2elsR5s5s"))
[[1]]
[1] “Ya2elsR5s5s”
[[2]]
[1] “2013-12-17T19:01:44.000Z”
etc.
Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.
Is this any easier using jsonlite?
require(jsonlite) id = "Ya2elsR5s5s" url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json") raw.data <- readLines(url, warn = "F") rd <- fromJSON(raw.data) term <- rd$entry$category$term[2] label <- rd$entry$category$label[2] title <- rd$entry$title author <- rd$entry$author[1] duration <- rd$entry$`media$group`$`media$content`$duration[1]
is this any better? I’m not convinced there’s much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object ‘rd’ contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.
What about social stats, e.g. facebook shares?
Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson
fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\"" url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky" queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"") #ignoring the callback part lookUp <- URLencode(queryUrl) #What do you think this does? lookUp ## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22" rd <- readLines(lookUp, warn = "F") require(rjson) dat <- fromJSON(rd) dat ## $data ## $data[[1]] ## $data[[1]]$share_count ## [1] 388 ## ## $data[[1]]$like_count ## [1] 430 ## ## $data[[1]]$comment_count ## [1] 231 dat$data[[1]]["share_count"] ## $share_count ## [1] 388
How does jsonlite compare?
require(jsonlite) dat <- fromJSON(rd) dat ## $data ## share_count like_count comment_count ## 1 388 430 231 dat$data$share_count ## [1] 388
Is that better? Yes, I think jsonlite in this case offers a significant improvement.
What about writing to JSON?
Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?
Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:
"9" : { "33" : { "74" : [ { "label" : "V155", "labs" : "Bird Flu and Epidemics" }, { "label" : "V415", "labs" : "Fowl and Meat Industry" } ], "75" : [ { "label" : "V166", "labs" : "Academics" }, { "label" : "V379", "labs" : "Places Of Study and Investigation" } ], "76" : [ { "label" : "V169", "labs" : "Space Exploration" }, { "label" : "V261", "labs" : "Cosmonauts" } ] }
My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a ‘flat’ JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.
Seriously: does anyone know how to automatically created nested data frames or lists?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.