adaR: An accurate, fast and WHATWG-compliant URL parser
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The other week, I found an interesting looking library on GitHub. ada-url, a WHATWG-compliant and fast URL parser written in modern C++. Since we need such a thing at work to analyze webtracking data, and I recently successfully wrapped my first C++ library into an R package, I thought I could do the same with ada-url. Little did I know, that wrapping the library will be the least tricky part of this endeavor.
Installation
You can install the development version of adaR from GitHub with:
# install.packages("devtools") devtools::install_github("schochastics/adaR")
The version on CRAN can be installed with
install.packages("adaR")
Parsing URLs with adaR
I have never dealt with anything that had so many corner-cases to consider than parsing URLs. Here are a few that drove me crazy along the way.
readLines("https://raw.githubusercontent.com/schochastics/adaR/main/data-raw/corner.txt") ## [1] "https://example.com:8080" ## [2] "http://user:[email protected]" ## [3] "http://[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:8080" ## [4] "https://example.com/path/to/resource?query=value&another=thing#fragment" ## [5] "http://sub.sub.example.com" ## [6] "ftp://files.example.com:2121/download/file.txt" ## [7] "http://example.com/path with spaces/and&special=characters?" ## [8] "https://user:pa%[email protected]/path" ## [9] "http://example.com/..//a/b/../c/./d.html" ## [10] "https://example.com:8080/over/under?query=param#and-a-fragment" ## [11] "http://192.168.0.1/path/to/resource" ## [12] "http://3com.com/path/to/resource" ## [13] "http://example.com/%7Eusername/" ## [14] "https://example.com/a?query=value&query=value2" ## [15] "https://example.com/a/b/c/.." ## [16] "ws://websocket.example.com:9000/chat" ## [17] "https://example.com:65535/edge-case-port" ## [18] "file:///home/user/file.txt" ## [19] "http://example.com/a/b/c/%2F%2F" ## [20] "http://example.com/a/../a/../a/../a/" ## [21] "https://example.com/./././a/" ## [22] "http://example.com:8080/a;b?c=d#e" ## [23] "http://@example.com" ## [24] "http://example.com/@test" ## [25] "http://example.com/@@@/a/b" ## [26] "https://example.com:0/" ## [27] "http://example.com/%25path%20with%20encoded%20chars" ## [28] "https://example.com/path?query=%26%3D%3F%23" ## [29] "http://example.com:8080/?query=value#fragment#fragment2" ## [30] "https://example.xn--80akhbyknj4f/path/to/resource" ## [31] "https://example.co.uk/path/to/resource" ## [32] "http://username:pass%[email protected]" ## [33] "ftp://downloads.example.edu:3030/files/archive.zip" ## [34] "https://example.com:8080/this/is/a/deeply/nested/path/to/a/resource" ## [35] "http://another-example.com/..//test/./demo.html" ## [36] "https://sub2.sub1.example.org:5000/login?user=test#section2" ## [37] "ws://chat.example.biz:5050/livechat" ## [38] "http://192.168.1.100/a/b/c/d" ## [39] "https://secure.example.shop/cart?item=123&quantity=5" ## [40] "http://example.travel/%60%21%40%23%24%25%5E%26*()" ## [41] "https://example.museum/path/to/artifact?search=ancient" ## [42] "ftp://secure-files.example.co:4040/files/document.docx" ## [43] "https://test.example.aero/booking?flight=abc123" ## [44] "http://example.asia/%E2%82%AC%E2%82%AC/path" ## [45] "http://subdomain.example.tel/contact?name=john" ## [46] "ws://game-server.example.jobs:2020/match?id=xyz" ## [47] "http://example.mobi/path/with/mobile/content" ## [48] "https://example.name/family/tree?name=smith" ## [49] "http://192.168.2.2/path?query1=value1&query2=value2" ## [50] "http://example.pro/professional/services" ## [51] "https://example.info/information/page" ## [52] "http://example.int/internal/systems/login" ## [53] "https://example.post/postal/services" ## [54] "http://example.xxx/age/verification" ## [55] "https://example.xxx/another/edge/case/path?with=query#and-fragment"
One corner case that actually made me get interested in URL parsing was something like http://example.com/@test
,
because the “@” makes the established parser urltools
fold.
urltools::url_parse("http://example.com/@test") ## scheme domain port path parameter fragment ## 1 http test <NA> <NA> <NA> <NA>
Unfortunately, “@” is quite common in URLs these days, thanks to Social Media and thus appears quite frequently in webtracking data.
adaR
is able to handle these type of URLs.
adaR::ada_url_parse("http://example.com/@test") ## href protocol username password host hostname ## 1 http://example.com/@test http: example.com example.com ## port pathname search hash ## 1 /@test
What you can see is that adaR
follows a different naming scheme and returns more components than urltools
. These terms and
a more general introduction to URL parsing can be found in the introductory vignette via vignette("adaR")
.
Here is one complete example of a URL that contains all components.
adaR::ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag") ## href protocol username ## 1 https://user_1:[email protected]:8080/api?q=1#frag https: user_1 ## password host hostname port pathname search hash ## 1 password_1 example.org:8080 example.org 8080 /api ?q=1 #frag
ada_url_parse()
is the power horse of adaR
which always returns all components of a URL. An important difference
to urltools
is that adaR
only return something, if the input is a valid URL. urltools
parses any type of input.
urltools::url_parse("I am not a URL") ## scheme domain port path parameter fragment ## 1 <NA> i am not a url <NA> <NA> <NA> <NA> adaR::ada_url_parse("I am not a URL") ## href protocol username password host hostname port pathname search ## 1 I am not a URL <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## hash ## 1 <NA>
A downside of this strict rule is that URLS without a protocol are not parsed.
adaR::ada_url_parse("domain.de/path/to/file") ## href protocol username password host hostname port pathname ## 1 domain.de/path/to/file <NA> <NA> <NA> <NA> <NA> <NA> <NA> ## search hash ## 1 <NA> <NA>
One can argue if this is either a bug or a feature, but for the time being, we remain conform with the underlying c++ library in this case.
If you only need one specific component of a URL, you can use the specialized ada_get_*()
functions.
To check if a component is present, use ada_has_*()
.
Benchmark
We conducted a series of Benchmark tests with hard to parse URLs. The result can be found on GitHub. Here I will just summarize some of the runtime results.
bench::mark( urltools = urltools::url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag"), ada = adaR::ada_url_parse("https://user_1:[email protected]:8080/dir/../api?q=1#frag", decode = FALSE), iterations = 1000, check = FALSE ) ## # A tibble: 2 × 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> ## 1 urltools 344µs 371µs 2685. 2.49KB 16.2 ## 2 ada 513µs 556µs 1778. 2.49KB 16.1 # crawl of the top visited 100 websites (98000 unique URLs) top100 <- readLines("https://raw.githubusercontent.com/ada-url/url-various-datasets/main/top100/top100.txt") bench::mark( urltools = urltools::url_parse(top100), ada = adaR::ada_url_parse(top100, decode = FALSE), iterations = 1, check = FALSE ) ## # A tibble: 2 × 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> ## 1 urltools 182ms 182ms 5.49 8.08MB 0 ## 2 ada 217ms 217ms 4.62 9.18MB 0
ada-url
is a really fast parser but to bring this performance to R was not that easy. While the runtime is still slightly slower,
the added accuracy makes up for this (at least in our use case).
Public Suffix parsing
The package also implements a public suffix extractor public_suffix()
, based on a lookup of the Public Suffix List.
Note that from this list, we only include registry suffixes (e.g., com, co.uk), which are those controlled by a domain name registry and governed by ICANN.
We do not include “private” suffixes (e.g., blogspot.com) that allow people to register subdomains. Hence, we use the term domain in the sense of “top domain under a registry suffix”.
urls <- c( "https://subsub.sub.domain.co.uk", "https://domain.api.gov.uk", "https://thisisnotpart.butthisispartoftheps.kawasaki.jp" ) adaR::public_suffix(urls) ## [1] "co.uk" "gov.uk" ## [3] "butthisispartoftheps.kawasaki.jp"
If you are wondering about the last url. The list also contains wildcard suffixes such as *.kawasaki.jp
which need to be matched.
(THIS specifically was one of the trickier things to implement…)
As a benchmark, we compare adaR
with urltools
and additionally with psl
, a wrapper for a C library to extract public suffix.
bench::mark( urltools = urltools::suffix_extract("https://user_1:[email protected]:8080/dir/../api?q=1#frag"), ada = adaR::public_suffix("https://user_1:[email protected]:8080/dir/../api?q=1#frag"), psl = psl::public_suffix("https://user_1:[email protected]:8080/dir/../api?q=1#frag"),iterations = 1000, check = FALSE ) ## # A tibble: 3 × 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> ## 1 urltools 329.2µs 371.37µs 2616. 97.16KB 7.87 ## 2 ada 18.8µs 19.93µs 49084. 5.17KB 0 ## 3 psl 3.5µs 3.73µs 260571. 17.62KB 0
(This comparison is not fair for urltools
since the function suffix_extract
does more than just extracting the public suffix.)
psl is clearly the fastest, which is not surprising given that it is based on extremely efficient C code. Our implementation is quite similar to how urltools handles suffixes and is not too far behind psl.
So, while psl is clearly favored in terms of runtime, it comes with the drawback that it is only available via GitHub (which is not optimal if you want to depend on it) and has a system requirement that (according to GitHub) is not available on Windows. If those two things do not matter to you and you need to process an enormous amount of URLs, then you should use psl.
Summary
I am not from the marketing department, so I say how it is: adaR does not bring much new to the table, beside a little more robust URL parsing. However, this accuracy can be important as when dealing with webtracking data which is a big deal for us at the moment.
Addendum
adaR
is part of a series of R packages to analyse webtracking data:
- webtrackR: preprocess raw webtracking data
- domainator: classify domains
- adaR: parse urls
Huge thanks to my colleague Chung-hong Chan, who greatly improved the package and alsotaught me one or two things on C++ in R code. He also wrote a blog post about the dev process.
The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.