Package update: longurl 0.3.0 is hitting CRAN mirrors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The longurl
package has been updated to version 0.3.0 as a result of a bug report noting that the URL expansion API it was using went pay-for-use. Since this was the second time a short URL expansion service either went belly-up or had breaking changes the package is now completely client-side-based and a very thin, highly-focused wrapper around the httr::HEAD()
function.
Why longurl
?
On the D&D alignment scale, short links are chaotic evil. [Full-disclosure: I use shortened links all the time, so the pot is definitely kettle-calling here]. Ostensibly, they are for making it easier to show memorable links on tiny, glowing rectangles or printed prose but they are mostly used to directly track you and mask other tracking parameters that the target site is using to keep tabs on you. Furthermore, short URLs are also used by those with even more malicious intent than greedy startups or mega-corporations.
In retrospect, giving a third-party API service access to URLs you are interested in expanding just exacerbated the tracking problem, but many of these third-party URL expansion services do use some temporal caching of results, so they can be a bit faster than doing this in a non-caching package (but, there’s nothing stopping you putting caching code around it if you are using it in a “production” capacity).
How does the updated package work without a URL expansion API?
By default, httr
“verb” requests use the curl
package and that is a wrapper for libcurl
. The httr
verb calls set the “please follow all HTTP status 3xx redirects that are found in responses” option (this is the libcurl
CURLOPT_FOLLOWLOCATION
equivalent option). There are other options that can be set to help configure minutae around how redirect following works. So, just by calling httr::HEAD(some_url)
you get built-in short URL expansion (if what you passed in was a short URL or a URL with a redirect).
Take, for example, this innocent link: http://lnk.direct/zFu
. We can see what goes on under the covers by passing in the verbose()
option to an httr::HEAD()
call:
httr::HEAD("http://lnk.direct/zFu", verbose()) ## -> HEAD /zFu HTTP/1.1 ## -> Host: lnk.direct ## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1 ## -> Accept-Encoding: gzip, deflate ## -> Cookie: shorturl=4e0aql3p49rat1c8kqcrmv4gn2 ## -> Accept: application/json, text/xml, application/xml, */* ## -> ## <- HTTP/1.1 301 Moved Permanently ## <- Server: nginx/1.0.15 ## <- Date: Sun, 18 Dec 2016 19:03:48 GMT ## <- Content-Type: text/html; charset=UTF-8 ## <- Connection: keep-alive ## <- X-Powered-By: PHP/5.6.20 ## <- Expires: Thu, 19 Nov 1981 08:52:00 GMT ## <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 ## <- Pragma: no-cache ## <- Location: http://ow.ly/Ko70307eKmI ## <- ## -> HEAD /Ko70307eKmI HTTP/1.1 ## -> Host: ow.ly ## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1 ## -> Accept-Encoding: gzip, deflate ## -> Accept: application/json, text/xml, application/xml, */* ## -> ## <- HTTP/1.1 301 Moved Permanently ## <- Content-Length: 0 ## <- Location: http://bit.ly/2gZq7qG ## <- Connection: close ## <- ## -> HEAD /2gZq7qG HTTP/1.1 ## -> Host: bit.ly ## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1 ## -> Accept-Encoding: gzip, deflate ## -> Accept: application/json, text/xml, application/xml, */* ## -> ## <- HTTP/1.1 301 Moved Permanently ## <- Server: nginx ## <- Date: Sun, 18 Dec 2016 19:04:36 GMT ## <- Content-Type: text/html; charset=utf-8 ## <- Content-Length: 127 ## <- Connection: keep-alive ## <- Cache-Control: private, max-age=90 ## <- Location: http://example.com/IT_IS_A_SURPRISE ## <- ## -> HEAD /IT_IS_A_SURPRISE HTTP/1.1 ## -> Host: example.com ## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1 ## -> Accept-Encoding: gzip, deflate ## -> Cookie: _csrf/link=g3iBgezgD_OYN0vOh8yI930E1O9ZAKLr4uHmVioxwwQ; mc=null; dmvk=5856d9e39e747; ts=475630; v1st=03AE3C5AD67E224DEA304AEB56361C9F ## -> Accept: application/json, text/xml, application/xml, */* ## -> ## <- HTTP/1.1 200 OK ## ... ## <-
We can reduce the clutter and see that it follows multiple redirects from multiple URL shorteners:
Here’s what the output of a request to longurl::expand_urls()
returns:
longurl::expand_urls("http://lnk.direct/zFu") ## # A tibble: 1 × 3 ## orig_url expanded_url status_code ## <chr> <chr> <int> ## 1 http://lnk.direct/zFu http://example.com/IT_IS_A_SURPRISE 200
NOTE: the link does actually go somewhere, and somewhere not malicious, political or preachy (a rarity in general in this post-POTUS-election world of ours).
What else is different?
The longurl::expand_urls()
function returns a tbl_df
and now includes the HTTP status code of the final, resolved link. You can also pass in a custom HTTP referrer since many (many) sites will change behavior depending on the referrer.
What’s next?
This bug-fix release had to go out fairly quickly since the package was essentially broken. With the new foundation being built on client-side machinations, future enhancements will be to pull more features (in the machine learning sense) out of the curl
or httr
requests (I may switch directly to using curl
if I need more granular data) and include some basic visualizations for both request trees (mostly likely using the DiagrammeR
and ggplot2
packages). I may try to add a caching layer, but I believe that’s more of a situation-specific feature folks should add on their own, so I may just add a “check hook” capability that will add an extra function call to a cache checking function of your choosing.
If you have a feature request, please add it to the github repo.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.