Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:
Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:
You can actually see a preview of those requests (usually JSON):
While you could go through all the headers and cookies and transcribe them into httr::GET
or httr::POST
requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl
and curl
packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:
curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed |
While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter
.
The curlconverter
package has (for the moment) two main functions:
straighten()
: which returns alist
with all of the necessary parts to craft anhttr
POST
orGET
callmake_req()
: which actually _returns a workinghttr
call, pre-filled with all of the necessary information.
By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req()
or req_params <- straighten()
), but they can take in a vector of cURL command lines, too (NOTE: make_req()
is currently limited to one while straighten()
can handle as many as you want).
Let’s show what happens using election results cURL command line:
REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed" resp <- curlconverter::straighten(REP) jsonlite::toJSON(resp, pretty=TRUE) ## [ ## { ## "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"], ## "method": ["get"], ## "headers": { ## "Pragma": ["no-cache"], ## "DNT": ["1"], ## "Accept-Encoding": ["gzip, deflate, sdch"], ## "X-Requested-With": ["XMLHttpRequest"], ## "Accept-Language": ["en-US,en;q=0.8"], ## "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"], ## "Accept": ["*/*"], ## "Cache-Control": ["no-cache"], ## "Connection": ["keep-alive"], ## "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"], ## "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"] ## }, ## "cookies": { ## "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"], ## "s_cc": ["true"] ## }, ## "url_parts": { ## "scheme": ["http"], ## "hostname": ["graphics.latimes.com"], ## "port": {}, ## "path": ["election-2016-31146-feed.json"], ## "query": {}, ## "params": {}, ## "fragment": {}, ## "username": {}, ## "password": {} ## } ## } ## ] |
You can then use the items in the returned list to make a GET
request manually (but still tediously).
curlconverter
‘s make_req()
will try to do this conversion for you automagically using httr
‘s little used VERB()
function. It’s easier to show than to tell:
curlconverter::make_req(REP) |
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", add_headers(Pragma = "no-cache", DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", `X-Requested-With` = "XMLHttpRequest", `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", Accept = "*/*", `Cache-Control` = "no-cache", Connection = "keep-alive", `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/")) |
You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).
You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter
Node module by Nick Carneiro.
It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.