Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continuing the blog’s UDF theme of late, there are two new UDF kids in town:
drill-url-tools
for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)drill-domain-tools
for slicing & dicing internet domain names (IDNs).
Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()
function already?” My answer is “Sure, but it’s based on java.net.URL
which is fundamentally broken.”
Slicing & dicing URLs and IDNs is a large part of the $DAYJOB
and they go together pretty well, hence the joint UDF release.
Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.
Counting Lying Lock Icons
SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).
If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL
There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.
Let’s get the data:
library(sergeant) library(tidyverse) read_delim( file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", delim = " ", col_names = FALSE, col_types = "ccc" ) -> gfg_update dl_path <- file.path("~/Data/gfg_links.tsv.gz") if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)
Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.
Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr
. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:
db <- src_drill() tbl(db, "( SELECT b.host, port, b.rec.hostname AS hostname, b.rec.assigned AS assigned, b.rec.tld AS tld, b.rec.subdomain AS subdomain FROM (SELECT host, port, suffix_extract(host) AS rec -- break the hostname into components FROM (SELECT a.rec.host AS host, a.rec.port AS port FROM (SELECT columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components FROM dfs.d.`/gfg_links.tsv.gz`) a WHERE a.rec.port IS NOT NULL -- filter out URL parsing failures ) ) b WHERE b.rec.tld IS NOT NULL -- filter out domain parsing failures )") -> gfg_df gfg_df ## # Database: DrillConnection ## hostname port host subdomain assigned tld ## <chr> <int> <chr> <chr> <chr> ## 1 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 2 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 3 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 4 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 5 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 6 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 7 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 8 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 9 www 80 www.eestikirik.ee NA eestikirik.ee ee ## 10 www 80 www.eestikirik.ee NA eestikirik.ee ee ## # ... with more rows
While we could have done it all in SQL, we saved some bits for R:
distinct(gfg_df, assigned, port) %>% count(port) %>% collect() -> port_counts port_counts # A tibble: 2 x 2 port n * <int> <int> 1 80 20648 2 443 22178
You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).
FIN
Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools
package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.
I'm not 100% settled on the galimatias
library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.
Kick the tyres & file issues & PRs as necessary.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.