New R Package For HTTP Headers Hashing

[This article was first published on rud.is"In God we trust. All others must bring data"R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

HTTP Headers Hashing (HHHash) is a technique developed by Alexandre Dulaunoy to generate a fingerprint of an HTTP server based on the headers it returns. It employs one-way hashing to generate a hash value from the list of header keys returned by the server. The HHHash value is calculated by concatenating the list of headers returned, ordered by sequence, with each header value separated by a colon. The SHA256 of this concatenated list is then taken to generate the HHHash value. HHHash incorporates a version identifier to enable updates to new hashing functions.

While effective, HHHash’s performance relies heavily on the characteristics of the HTTP requests, so correlations are typically only established using the same crawler parameters. Locality-sensitive hashing (LSH) could be used to calculate distances between sets of headers for more efficient comparisons. There are some limitations with some LSH algorithms (such as the need to pad content to a minimum byte length) that make the initial use of SHA256 hashes a bit more straightforward.

Alexandre made a Python library for it, and I cranked out an R package for it as well.

There are three functions exposed by {hhhash}:

  • build_hash_from_response: Build a hash from headers in a curl
    response object
  • build_hash_from_url: Build a hash from headers retrieved from a URL
  • hash_headers: Build a hash from a vector of HTTP header keys

The build_hash_from_url function relies on {curl} vs {httr} since {httr} uses curl::parse_headers() which (rightfully so) lowercases the header keys. We need to preserve both order and case for the hash to be useful.

Here is some sample usage:

remotes::install_github("hrbrmstr/hhhash")

library(hhhash)

build_hash_from_url("https://www.circl.lu/")
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"

res <- curl::curl_fetch_memory("https://www.circl.lu/", curl::new_handle())

build_hash_from_response(res)
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"

c(
  "Date", "Server", "Strict-Transport-Security",
  "Last-Modified", "ETag", "Accept-Ranges",
  "Content-Length", "Content-Security-Policy",
  "X-Content-Type-Options", "X-Frame-Options",
  "X-XSS-Protection", "Content-Type"
) -> keys

hash_headers(keys)
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"
To leave a comment for the author, please follow the link and comment on their blog: rud.is"In God we trust. All others must bring data"R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)