Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.
If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson
package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson
— is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.
The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant
(it’s hitting CRAN soon now that dplyr
0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl
that you can then use dplyr
verbs to manipulate.
The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.