{catchpole} Redux and Hashing Files & Websites with {ssdeepr}
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:
If we tweak the buffer space around the squares, I think the cartogram looks better:
but, you should likely use a different palette (see this Twitter thread for examples).
I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:
library(sf) library(hrbrthemes) library(catchpole) library(tidyverse) delegates <- read_delegates() candidates_expanded <- expand_candidates() gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx")) m <- delegates_map() # split off each "area" on the map so we can make a border+background list( setdiff(state.abb, c("HI", "AK")), "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS" ) %>% map(~{ suppressWarnings(suppressMessages(st_buffer( x = st_union(m[m$state %in% .x, ]), dist = 0.0001, endCapStyle = "SQUARE" ))) }) -> m_borders gg <- ggplot() for (mb in m_borders) { gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125) } gg + geom_sf( data = gsf, aes(fill = candidate), col = "white", shape = 22, size = 3, stroke = 0.125 ) + scale_fill_manual( name = NULL, na.value = "#f0f0f0", values = c( "Biden" = '#f0027f', "Sanders" = '#7fc97f', "Warren" = '#beaed4', "Buttigieg" = '#fdc086', "Klobuchar" = '#ffff99', "Gabbard" = '#386cb0', "Bloomberg" = '#bf5b17' ), limits = intersect(unique(delegates$candidate), names(delegates_pal)) ) + guides( fill = guide_legend( override.aes = list(size = 4) ) ) + coord_sf(datum = NA) + theme_ipsum_es(grid="") + theme(legend.position = "bottom")
{ssdeepr}
Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.
Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.
I recommend using the hash_con()
function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file()
doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).
These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?
library(ssdeepr) # see the links above for installation cran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html")) cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html")) cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html")) hash_compare(cran1, cran2) ## [1] 0 hash_compare(cran1, cran3) ## [1] 94
I picked on cran.biotools.fr
as I saw they were well-behind CRAN-proper on the monitoring page.
I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!
library(httr) library(ssdeepr) library(splashr) # regular grab h1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth")) # you need Splash running for javascript-enabled scraping this way sp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass") # js-enabled with one ua sp %>% splash_user_agent(ua_macos_chrome) %>% splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>% splash_wait(2) %>% splash_html(raw_html = TRUE) -> js1 # js-enabled with another ua sp %>% splash_user_agent(ua_ios_safari) %>% splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>% splash_wait(2) %>% splash_html(raw_html = TRUE) -> js2 h2 <- hash_raw(js1) h3 <- hash_raw(js2) # same way {rvest} does it res <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth") h4 <- hash_raw(content(res, as = "raw"))
Now, let’s compare them:
hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal ## [1] 100 # things look way different with js-enabled hash_compare(h1, h2) ## [1] 0 hash_compare(h1, h3) ## [1] 0 # and with variations between user-agents hash_compare(h2, h3) ## [1] 0 hash_compare(h2, h4) ## [1] 0 # only doing this for completeness hash_compare(h3, h4) ## [1] 0
For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):
length(js1) ## [1] 432914 length(js2) ## [1] 270538 nchar( paste0( readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")), collapse = "\n" ) ) ## [1] 373078 length(content(res, as = "raw")) ## [1] 374099
FIN
If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!
The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).
As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.