Access the Internet Archive Advanced Search/Scrape API with wayback (+ a links to a new vignette & pkgdown site)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The wayback
package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.
Search/Scrape
The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape()
but you can also paginate through results with the helper functions provided.
To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier
is the key element to ensure we can retrieve the information about a particular collection.
library(wayback) nasa <- ia_scrape("collection:nasa", count=100L) tibble:::print.tbl_df(nasa) ## # A tibble: 100 x 3 ## identifier addeddate title ## <chr> <chr> <chr> ## 1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit ## 2 00-042-32 2009-08-26T16:30:12Z Swamp to Space historical exhibit ## 3 00-042-43 2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command … ## 4 00-042-56 2009-08-26T16:30:19Z Test Control Center exhibit ## 5 00-042-71 2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit ## 6 00-042-94 2009-08-26T16:30:24Z RocKeTeria restaurant ## 7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit ## 8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise ## 9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop ## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit ## # ... with 90 more rows (item <- ia_retrieve(nasa$identifier[1])) ## # A tibble: 6 x 4 ## file link last_mod size ## 1 00-042-154.jpg https://archive.org/download/00-042-154/00-042-154.jpg 06-Nov-2000 15:34 1.2M ## 2 00-042-154_archive.torrent https://archive.org/download/00-042-154/00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K ## 3 00-042-154_files.xml https://archive.org/download/00-042-154/00-042-154_files.xml 06-Jul-2018 11:14 1.7K ## 4 00-042-154_meta.xml https://archive.org/download/00-042-154/00-042-154_meta.xml 03-Jun-2016 02:06 1.4K ## 5 00-042-154_thumb.jpg https://archive.org/download/00-042-154/00-042-154_thumb.jpg 26-Aug-2009 16:30 7.7K ## 6 __ia_thumb.jpg https://archive.org/download/00-042-154/__ia_thumb.jpg 06-Jul-2018 11:14 26.6K download.file(item$link[1], file.path("man/figures", item$file[1]))
I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.
The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.
FIN
The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.
There's also a new pkgdown
-gen'd site for the package.
Issues & PRs welcome at your community coding site of choice.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.