Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve blogged a bit about robots.txt
— the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike.
There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (security.txt
) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual).
One core difference is that this file is intended for humans. If you are are a general user and visit a site and notice something “off” (security-wise) or if you are an honest, honorable security researcher who found a vulnerability or weakness on a site, this security.txt
file should make it easier to contact the appropriate folks at the site to help them identify and resolve security issues. The IETF abstract summarizes the intent well:
A big change from robots.txt
is where the security.txt
file goes. The IETF standard is still in draft state so the location may change, but the current thinking is to have it go into /.well-known/security.txt
vs being placed in the top level root (i.e. it’s not supposed to be in /security.txt
). If you aren’t familiar with the .well-known
directory, give RFC 5785 a read.
You can visit the general information site to find out more and install a development version of a Chrome extension that will make it easier for pull up this info in your browser if you find an issue.
Here’s the security.txt
for my site:
Contact: bob@rud.is Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 Disclosure: Full
With that info, you know where to contact me, have the ability to encrypt your message and know that I’ll give you credit and will disclose the bugs openly.
So, Why the [R] tag?
Ah, yes. This post is in the R
RSS category feed for a reason. I do at-scale analysis of the web for a living and will be tracking the adoption of security.txt
across the internet (initially with the Umbrella Top 1m and a choice list of sites with more categorical data associated with them) over time. My esteemed colleague @jhartftw is handling the crawling part, but I needed a way to speedily read in these files for a broader analysis. So, I made an R package: securitytxt
?.
It’s pretty easy to use. Here’s how to install it and use one of the functions to generate a security.txt
target URL for a site:
devtools::install_github("hrbrmstr/securitytxt") library(securitytxt) (xurl <- sectxt_url("https://rud.is/b")) ## [1] "https://rud.is/.well-known/security.txt"
This is how you read in and parse a security.txt
file:
(x <- sectxt(url(xurl))) ## <Web Security Policies Object> ## Contact: bob@rud.is ## Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 ## Disclosure: Full
And, this is how you turn that into a usable data frame:
sectxt_info(x) ## key value ## 1 contact bob@rud.is ## 2 encryption https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 ## 3 disclosure Full
There’s also a function to validate that the keys are within the current IETF standard. That will become more useful once the standard moves out of draft status.
FIN
So, definitely adopt the standard and feel invited to kick the tyres on the package. Don’t hesitate to jump on board if you have ideas for how you’d like to extend the package, and drop a note in the comments if you have questions on it or on adopting the standard for your site.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.