Pirating Web Content Responsibly With R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.
There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.
We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.
Scouring The Seas Web For Pirate Data
Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:
(site png snapshot taken with splashr
)
I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:
library(V8) library(stringi) library(httr) library(rvest) library(robotstxt) library(jwatr) # github/hrbrmstr/jwatr library(hrbrthemes) library(purrrlyr) library(rprojroot) library(tidyverse) report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459 https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339 https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138 https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884 https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623") by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>% pull(url_list) %>% flatten_chr() -> target_urls head(target_urls) ## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345" ## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346" ## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347" ## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348" ## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349" ## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"
Time to pillage some details!
But…Can We Really Do It?
I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt
to stop pirates. Let’s take a look:
robotstxt::get_robotstxt("https://www.icc-ccs.org/") ## # If the Joomla site is installed within a folder such as at ## # e.g. www.example.com/joomla/ the robots.txt file MUST be ## # moved to the site root at e.g. www.example.com/robots.txt ## # AND the joomla folder name MUST be prefixed to the disallowed ## # path, e.g. the Disallow rule for the /administrator/ folder ## # MUST be changed to read Disallow: /joomla/administrator/ ## # ## # For more information about the robots.txt standard, see: ## # http://www.robotstxt.org/orig.html ## # ## # For syntax checking, see: ## # http://www.sxw.org.uk/computing/robots/check.html ## ## User-agent: * ## Disallow: /administrator/ ## Disallow: /cache/ ## Disallow: /cli/ ## Disallow: /components/ ## Disallow: /images/ ## Disallow: /includes/ ## Disallow: /installation/ ## Disallow: /language/ ## Disallow: /libraries/ ## Disallow: /logs/ ## Disallow: /media/ ## Disallow: /modules/ ## Disallow: /plugins/ ## Disallow: /templates/ ## Disallow: /tmp/
Ahoy! We’ve got a license to pillage!
But, we don’t have a license to abuse their site.
While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt
file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:
s_GET <- safely(GET) pb <- progress_estimated(length(target_urls)) map(target_urls, ~{ pb$tick()$print() Sys.sleep(5) s_GET(.x) }) -> httr_raw_responses write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds") good_responses <- keep(httr_raw_responses, ~!is.null(.x$result)) jwatr::response_list_to_warc_file(good_responses, "data/icc-good")
There are more “safety” measures you can use with httr::GET()
but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.
I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr
response
objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.
Digging For Treasure
Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:
Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes
or splashr
?”
Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:
# make field names great again mfga <- function(x) { x <- tolower(x) x <- gsub("[[:punct:][:space:]]+", "_", x) x <- gsub("_+", "_", x) x <- gsub("(^_|_$)", "", x) x <- make.unique(x, sep = "_") x } # I know the columns I want and this makes getting them into the types I want easier cols( attack_number = col_character(), attack_posn_map = col_character(), date = col_datetime(format = ""), date_time = col_datetime(format = ""), id = col_integer(), location_detail = col_character(), narrations = col_character(), type_of_attack = col_character(), type_of_vessel = col_character() ) -> pirate_cols # iterate over the good responses with a progress bar pb <- progress_estimated(length(good_responses)) map_df(good_responses, ~{ pb$tick()$print() # `safely` hides the data under `result` so expose it doc <- content(.x$result) # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8 html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>% html_text() %>% stri_split_lines() %>% .[[1]] %>% grep("narrations_ro", ., value=TRUE) %>% sprintf("var dat = %s;", .) %>% ctx$eval() p <- ctx$get("dat", flatten=TRUE) # now, process that data, turing the ugly returned list content into something we can put in a data frame keep(p[[1]], is.list) %>% map_df(~{ list( field = mfga(.x[[3]]$label), value = .x[[3]]$value ) }) %>% filter(value != "") %>% distinct(field, .keep_all = TRUE) %>% spread(field, value) }) %>% type_convert(col_types = pirate_cols) %>% filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>% filter(lubridate::year(date) > 2012) %>% mutate( attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""), attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "") ) %>% separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>% mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df write_rds(pirate_df, "data/pirate_df.rds")
The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.
Next, I setup a cols
object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert
at the end vs have a slew of as.numeric()
(et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.
Now we can iterate over the good (complete) responses.
The purrr::safely
function shoves the real httr
response in result
so we focus on that then “surgically” extract the target data from the <script
> tag. Once we have it, we get it into a form we can feed into the V8
javascript engine and then retrieve the data from said evaluation.
Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date:
fields, so we throw in some logic to help avoid duplicate field names as well.
That whole process goes really quickly, but why not save off the clean data at the end for good measure?
Gotta Have A Pirate Map
Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:
mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>% count(year_mon) %>% ggplot(aes(year_mon, n)) + geom_segment(aes(xend=year_mon, yend=0)) + scale_y_comma() + labs(x=NULL, y=NULL, title="(Confirmed) Piracy Incidents per Month", caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") + theme_ipsum_rc(grid="Y")
And, finally, a map showing pirate encounters but colored by year:
world <- map_data("world") mutate(pirate_df, year = lubridate::year(date)) %>% arrange(year) %>% mutate(year = factor(year)) -> plot_df ggplot() + geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") + geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) + ggalt::coord_proj("+proj=wintri") + viridis::scale_color_viridis(name=NULL, discrete=TRUE) + labs(x=NULL, y=NULL, title="Piracy Incidents per Month (Confirmed)", caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") + theme_ipsum_rc(grid="XY") + theme(legend.position = "bottom")
Taking Up The Mantle of the Dread Pirate Hrbrmstr
Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.
There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.