New package: curl. High performance http(s) streaming in R.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A bit ago I blogged about new streaming features in jsonlite:
library(jsonlite) diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))
In the same blog post it was also mentioned that R does currently not support https connections. The RCurl
package does support https, but does not have a connection interface. This bothered me so I decided to write one. The result is the new curl package.
Encryption, compression and more
From the package description:
The curl() function provides a drop-in replacement for base url() with better performance and support for http 2.0, ssl (https, ftps), gzip, deflate and other libcurl goodies. This interface is implemented using the RConnection API in order to support incremental processing of both binary and text streams.
What this means is that curl()
should be able to do anything that url()
does, but better. The same example as above, but now with https:
library(curl) library(jsonlite) diamonds2 <- stream_in(curl("https://jeroenooms.github.io/data/diamonds.json"))
That was easy. Switching to curl has other benefits as well. For example it automatically recognizes and decompresses gzipped or deflated connections from the Accept-Encoding
header:
readLines(curl("http://httpbin.org/gzip"), warn = FALSE) readLines(curl("http://httpbin.org/deflate"), warn = FALSE)
Support for compression can make a huge difference when streaming large data. Text based formats such as json are popular because they are human readable, but the main downside of plain-text is inefficiency for storing numbers. However when gzipped, json payloads are often comparable to binary formats, giving you the best of both worlds.
Performance
One thing that did surprise me a bit is the difference in performance. Especially the implementation of readLines
for url connections seems to be inefficient in base R.
con2 <- curl("http://jeroenooms.github.io/data/diamonds.json") system.time(readLines(con2)) # user system elapsed # 0.238 0.096 0.334 con1 <- url("http://jeroenooms.github.io/data/diamonds.json") system.time(readLines(con1)) # user system elapsed # 0.236 0.113 3.858
I’m not quite sure why this is. Maybe the base R version does some additional character recoding that I am not aware of, although I have not observed such behavior. Also measuring performance is tricky in this case because it depends on the connection bandwidth, caching settings, etc.
Compare to RCurl, httr
The curl
package is not intended as an alternative for RCurl or httr. The latter packages also use libcurl, but provide a more flexible client for performing http requests in R. The purpose of the curl
package is mainly to reimplement functionality already found in base R, in a way that (in a parallel universe) would allow r-core to adopt these changes to start supporting https in url
and donwload.file
, etc.
Do note that this is an initial release and the RConnection API is a bit experimental, so there might be bugs 🙂 In fact I’ve already made quite some changes since the CRAN release. If you report a bug, please make sure to replicated it with the latest dev version from github:
library(devtools) install_github("jeroenooms/curl")
For some more fun examples, see the curl manual page.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.