Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am fairly new to webscraping in R using rvest and one question is whether a site gives permission for scraping. This information is often contained in the robots.txt file on a website. So, I’m briefly going to explore the ROpenSci robotstxt package by Peter Meissner. robotstxt provides easy access to the robots.txt file for a domain from R.
I’m slowly working on a new R data package for underwater geographic feature names as part of a Norwegian Research Council funded project biospolar
on innovation involving biodiversity in marine polar areas. One of the main data sources for the package is the General Bathymetric Chart of the Oceans or GEBCO Gazeteer. I’m also going to be bringing in data from the Interridge database of hydrothermal vents and so wanted to understand whether I am just free to go ahead.
The robots.txt content is advisory, and well we could always choose to be Dr. Evil. If my wife would let me have a cat it would definitely be called Mr. Bigglesworth. But it strikes me that building a package for a data source that tries to prohibit scraping might not be a brilliant idea.
There are a bunch of functions in the robotstxt
package but I’m just going to use the main one robotstxt()
. Take a look at the vignette for more information. For a very quick check on whether scraping on a path is allowed try the paths_allowed()
function. I’ll come back to that at the end.
The first place I am going to look is the main GEBCO domain.
library(robotstxt) gebco <- robotstxt("https://www.gebco.net") gebco ## $domain ## [1] "https://www.gebco.net" ## ## $text ## [1] "Sitemap: https://www.gebco.net/sitemap.xml \r\n\r\nUser-agent: *\r\nHost: www.gebco.net\r\nDisallow: /cgi-bin/\r\nDisallow: /perl/\r\nDisallow: /css/\r\nDisallow: /js/\r\nDisallow: /_mm/\r\nDisallow: /_notes/\r\n\n[... 36 lines omitted ...]" ## ## $bots ## [1] "*" "Googlebot" "Googlebot-Image" ## [4] "Googlebot-Mobile" ## ## $comments ## [1] line comment ## <0 rows> (or 0-length row.names) ## ## $permissions ## field useragent value ## 1 Disallow * /cgi-bin/ ## 2 Disallow * /perl/ ## 3 Disallow * /css/ ## 4 Disallow * /js/ ## 5 Disallow * /_mm/ ## 6 Disallow * /_notes/ ## 7 ## 8 [... 31 items omitted ...] ## ## $crawl_delay ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $host ## field useragent value ## 1 Host * www.gebco.net ## ## $sitemap ## field useragent value ## 1 Sitemap * https://www.gebco.net/sitemap.xml ## ## $other ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $robexclobj ## <Robots Exclusion Protocol Object> ## $check ## function (paths = "/", bot = "*") ## { ## spiderbar::can_fetch(obj = self$robexclobj, path = paths, ## user_agent = bot) ## } ## <bytecode: 0x7fc3af22a750> ## <environment: 0x7fc3af24bef8> ## ## attr(,"class") ## [1] "robotstxt"
This returns a list from the robots txt where the main bit I am interested in is the data frame under gebco$permissions.
field | useragent | value |
---|---|---|
Disallow | * | /cgi-bin/ |
Disallow | * | /perl/ |
Disallow | * | /css/ |
Disallow | * | /js/ |
Disallow | * | /_mm/ |
Disallow | * | /_notes/ |
Disallow | * | /_baks/ |
Disallow | * | /MMWIP/ |
Disallow | Googlebot | /cgi-bin/ |
Disallow | Googlebot | /perl/ |
Disallow | Googlebot | /css/ |
Disallow | Googlebot | /js/ |
Disallow | Googlebot | /_mm/ |
Disallow | Googlebot | /_notes/ |
Disallow | Googlebot | /_baks/ |
Disallow | Googlebot | /MMWIP/ |
Disallow | Googlebot | /*templates |
Disallow | Googlebot | */log.gif |
Disallow | Googlebot | /*_baks |
Disallow | Googlebot | /*_notes |
Disallow | Googlebot | /js |
Disallow | Googlebot | *.csi |
Disallow | Googlebot | *.vcf |
Disallow | Googlebot-Image | /cgi-bin/ |
Disallow | Googlebot-Image | /perl/ |
Disallow | Googlebot-Image | /css/ |
Disallow | Googlebot-Image | /js/ |
Disallow | Googlebot-Image | /_mm/ |
Disallow | Googlebot-Image | /_notes/ |
Disallow | Googlebot-Image | /_baks/ |
Disallow | Googlebot-Image | /MMWIP/ |
Disallow | Googlebot-Image | */log.gif |
Disallow | Googlebot-Mobile | /*templates |
Disallow | Googlebot-Mobile | */log.gif |
Disallow | Googlebot-Mobile | /*_baks |
Disallow | Googlebot-Mobile | /*_notes |
What is of interest here are the entries under Value which can be a bit difficult to interpret. With the help of the handy Wikipedia article on the Robots Exclusion Standard I can see that:
Disallow + *
means to stay out of the website altogether.Disallow + /xyz
means to stay out of the specific directories.Disallow Googlebot
means that the named bot should stay out of either the website or (as in this case) specific directories. Note that Googlebot appears to be in the naughty seat because the site is more specific about what it should stay out of while others would be free to enter?
However, the GEBCO data files that I am interested in are not hosted on the gebco.net domain but on the NOAA National Centers for Environmental Information domain.
noaa <- robotstxt(domain = "https://www.ngdc.noaa.gov") noaa ## $domain ## [1] "https://www.ngdc.noaa.gov" ## ## $text ## [1] "User-agent: *\nCrawl-delay: 60\nDisallow: /cgi-bin\nDisallow: /dmsp/cgi-bin\nDisallow: /dmsp/data\nDisallow: /dmsp/include\nDisallow: /dmsp/protected\nDisallow: /eog\nDisallow: /geomag/cdroms\nDisallow: /geomag/data\n\n[... 67 lines omitted ...]" ## ## $bots ## [1] "*" ## [2] "LinkChecker" ## [3] "siteimprove" ## [4] "Mozilla/5.0(compatible;MSIE10.0;WindowsNT6.1;Trident/6.0)LinkCheckbySiteimprove.com" ## [5] "Mozilla/5.0(compatible;MSIE10.0;WindowsNT6.1;Trident/6.0)SiteCheck-sitecrawlbySiteimprove.com" ## [6] "HTMLValidatorbysiteimprove.com/1.3" ## ## $comments ## [1] line comment ## <0 rows> (or 0-length row.names) ## ## $permissions ## field useragent value ## 1 Disallow * /cgi-bin ## 2 Disallow * /dmsp/cgi-bin ## 3 Disallow * /dmsp/data ## 4 Disallow * /dmsp/include ## 5 Disallow * /dmsp/protected ## 6 Disallow * /eog ## 7 ## 8 [... 73 items omitted ...] ## ## $crawl_delay ## field useragent value ## 1 Crawl-delay * 60 ## ## $host ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $sitemap ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $other ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $robexclobj ## <Robots Exclusion Protocol Object> ## $check ## function (paths = "/", bot = "*") ## { ## spiderbar::can_fetch(obj = self$robexclobj, path = paths, ## user_agent = bot) ## } ## <bytecode: 0x7fc3af22a750> ## <environment: 0x7fc3aee6a4e0> ## ## attr(,"class") ## [1] "robotstxt"
The NOAA robotstxt provides some different information. For example, NOAA specifies a crawl delay of 60 seconds which tells me to build in a delay of 60 seconds to a call.
noaa$text ## User-agent: * ## Crawl-delay: 60 ## Disallow: /cgi-bin ## Disallow: /dmsp/cgi-bin ## Disallow: /dmsp/data ## Disallow: /dmsp/include ## Disallow: /dmsp/protected ## Disallow: /eog ## Disallow: /geomag/cdroms ## Disallow: /geomag/data ## Disallow: /geomag/EMM/data ## Disallow: /geomag/pmag/datafiles ## Disallow: /geomag/WMM/data ## Disallow: /globe ## Disallow: /hazard/data ## Disallow: /hazard/img ## Disallow: /IAGA/cgi-bin ## Disallow: /idb ## Disallow: /ionosonde ## Disallow: /mgg/cgi-bin ## Disallow: /mgg/curator/data ## Disallow: /mgg/curator/userfiles ## Disallow: /mgg/dat ## Disallow: /mgg/ecs/data ## Disallow: /mgg/gdas/data ## Disallow: /mgg/geology/data ## Disallow: /mgg/geology/odp/data ## Disallow: /mgg/grids/data ## Disallow: /mgg/oracle ## Disallow: /mgg/tmp ## Disallow: /mgg/trk ## Disallow: /ngdc/cgi-bin ## Disallow: /ngdc/hn ## Disallow: /ngdc/Counter ## Disallow: /ngdc/NOAAServer/adm ## Disallow: /ngdc/NOAAServer/converters ## Disallow: /ngdc/NOAAServer/gif ## Disallow: /ngdc/NOAAServer/java ## Disallow: /ngdc/NOAAServer/lib ## Disallow: /ngdc/NOAAServer_N ## Disallow: /ngdc/Store ## Disallow: /nmmr ## Disallow: /nndc ## Disallow: /paleo ## Disallow: /riwebapp/rest ## Disallow: /seg/cgi-bin ## Disallow: /stp/bin ## Disallow: /stp/cgi-bin ## Disallow: /stp/drap/data ## Disallow: /stp/include ## Disallow: /stp/image ## Disallow: /stp/images ## Disallow: /stp/include ## Disallow: /stp/iono/drap ## Disallow: /stp/iono/ustec/products ## Disallow: /stp/satellite/poes/dataaccess.html ## Disallow: /stp/satellite/goes/dataaccess.html ## Disallow: /sxi/servlet/sxibrowse ## Disallow: /sxi/servlet/sximovie ## Disallow: /sxi/servlet/sxisearch ## Disallow: /stp/IONO/ionosonde ## Disallow: /thredds ## Disallow: /wdc/cgi-bin ## ## ## User-agent: LinkChecker ## Disallow: ## ## User-agent: siteimprove ## Disallow: / ## User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com ## Disallow: / ## User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) SiteCheck-sitecrawl by Siteimprove.com ## Disallow: / ## User-agent: HTML Validator by siteimprove.com/1.3 ## Disallow: /
We then see a list of disallowed directories and in this case I am interested in the https://www.ngdc.noaa.gov/gazetteer/
The dir I am interested in for the package is not on the list so I think I am free to go ahead… yay!
If I wanted to do this more quickly I would use the paths_allowed()
function.
paths_allowed("https://www.ngdc.noaa.gov/gazetteer/") ## [1] TRUE
So, there we have it. If we prefer to be good web scraping citizens rather than the Dr. Evil of web scraping in R then the robotstxt
package will help us out. On the other hand we could just be evil and see what happens. I’m off to stroke Mr. Bigglesworth.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.