High performance JSON streaming in R: Part 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The jsonlite stream_in and stream_out functions implement line-by-line processing of JSON data over a connection, such as a socket, url, file or pipe. Thereby we can construct a data processing pipeline that can handle large (or unlimited) amounts of data with limited memory. This post will walk through some examples from the help pages.
The json streaming format
Because parsing huge JSON strings is difficult and inefficient, JSON streaming is done using lines of minified JSON records. This is pretty standard: JSON databases such as MongoDB use the same format to import/export large datasets. Note that this means that the total stream combined is not valid JSON itself; only the individual lines are.
library(jsonlite) x <- iris[1:3,] stream_out(x, con = stdout()) # {"Sepal.Length":5.1,"Sepal.Width":3.5,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"} # {"Sepal.Length":4.9,"Sepal.Width":3,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"} # {"Sepal.Length":4.7,"Sepal.Width":3.2,"Petal.Length":1.3,"Petal.Width":0.2,"Species":"setosa"}
Also note that because line-breaks are used as separators, prettified JSON is not permitted: the JSON lines must be minified. In this respect, the format is a bit different from fromJSON and toJSON where all lines are part of a single JSON structure with optional line breaks.
Streaming to/from a file
The nycflights13
package contains a dataset with about 5 million values. To stream this to a file:
library(nycflights13) stream_out(flights, con = file("~/flights.json"))
Running this code will open the file connection, write json to the connection in batches of 500 rows, and afterwards close the connection. Status messages will be printed to the console while writing output. The entire process should take a few seconds and generate a json file of about 7MB.
We use the same file to illustrate how to stream the json back into R. The following code will stream-parse the json in batches of 500 lines. Afterward we verify that the output is indeed identical to the original one:
flights2 <- stream_in(file("~/flights.json")) all.equal(flights2, as.data.frame(flights)) # [1] TRUE
Because the data is read in small batches, this require much less memory than when we would try to parse a huge json blob all at once. The pagesize
argument in stream_in
and stream_out
can be used to specify the number of rows that will be read/written per iteration.
Streaming from a URL
We can use the standard url
function in R to stream from a HTTP connection.
diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))
If the data source is gzipped, simply wrap the connection in gzcon
.
flights3 <- stream_in(gzcon(url("http://jeroenooms.github.io/data/nycflights13.json.gz"))) all.equal(flights3, as.data.frame(flights))
Because R currently does not support SSL, we use a curl
pipe to stream over HTTPS:
flights4 <- stream_in(gzcon(pipe("curl https://jeroenooms.github.io/data/nycflights13.json.gz"))) all.equal(flights4, as.data.frame(flights))
For this to work, the curl
executable needs to be installed and available in the search path, which requires cygwin on Windows. Unfortunately the RCurl package does not seem to support binary streaming at this point.
Next up
These examples illustrate basic line-by-line json streaming of data frames from/to a connection, which allows for importing/exporting large json datasets.
In the next blog post we will make the step to full JSON IO streaming by defining a custom handler
function. This allows for constructing a json data processing pipeline in R that can handle an infinite data stream. Impatient readers can have a look at the examples in the stream_in help page.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.