Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I was asked by a friend how to find the full final address of an URL which had been shortened via a shortening service (e.g., Twitter’s t.co, Google’s goo.gl, Facebook’s fb.me, dft.ba, bit.ly, TinyURL, tr.im, Ow.ly, etc.). I replied I had no idea and maybe he should have a look over on StackOverflow.com or, possibly, the R-help list, and if that didn’t turn up anything to try an online unshortening service like http://unshort.me.
Two minutes later he came back with this solution from Stack Overflow which, surpsingly to me, contained an answer I had provided about 1.5 years ago!
This has always been my problem with programming, that I learn something useful and then completely forget it. I’m kind of hoping that by having this blog it will aid me in remembering these sorts of things.
The Objective
I want to decode a shortened URL to reveal it’s full final web address.
The Solution
The basic idea is to use the getURL function from the RCurl package and telling it to retrieve the header of the webpage it’s connection too and extract the URL location from there.
decode_short_url <- function(url, ...) { # PACKAGES # require(RCurl) # LOCAL FUNCTIONS # decode <- function(u) { Sys.sleep(0.5) x <- try( getURL(u, header = TRUE, nobody = TRUE, followlocation = FALSE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) ) if(inherits(x, 'try-error') | length(grep(".*Location: (\\S+).*", x))<1) { return(u) } else { return(gsub('.*Location: (\\S+).*', '\\1', x)) } } # MAIN # gc() # return decoded URLs urls <- c(url, ...) l <- vector(mode = "list", length = length(urls)) l <- lapply(urls, decode) names(l) <- urls return(l) }
And here’s how we use it:
# EXAMPLE # decode_short_url("http://tinyurl.com/adcd", "http://www.google.com") # $`http://tinyurl.com/adcd` # [1] "https://www.r-project.org/" # # $`http://www.google.com` # [1] "http://www.google.co.uk/"
You can always find the latest version of this function here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/decode_shortened_url/decode_shortened_url.R
Limitations
A comment on the R-bloggers facebook page for this blog post made me realise that this doesn’t work with every shortened URL such as when you need to be logged in for a service, e.g.,
decode_short_url("http://tinyurl.com/adcd", "http://www.google.com", "http://1.cloudst.at/myeg") # $`http://tinyurl.com/adcd` # [1] "https://www.r-project.org/" # # $`http://www.google.com` # [1] "http://www.google.co.uk/" # # $`http://1.cloudst.at/myeg` # [1] "http://1.cloudst.at/myeg"
I still don’t know why this might be a useful thing to do but hopefully it’s useful to someone out there
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.