When rvest is not enough
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An intorduction to CasperJS for R users
Web scraping is a data mining technique that allows to transform data from unstructured (widely available on the interent under the form of webpages) to structured datasets. This is mostly done exploiting the fact that webpages have some sort of structure, in the form of xml
markup language. An xml
file organizes content inside nested nodes (or tags when talking about html), all of which have attributes and contents.
Rvest Limitations
Hadley Wickham’s rvest
is an excellent tool for scraping websites. All it takes is to provide the url of the site, the nodes of interest and which attributes extract from those nodes. To give an example, we can scrape all the links to my socials in the main page of my website with few lines of code:
library(rvest) 'http://giovannikraushaar.ch' %>% read_html() %>% html_node('div#home-social') %>% html_nodes('a') %>% html_attr('href')
This workflow works wonderfully as long as the web page is static. Nevertheless, nowadays there are many pages that are dynamic. This means that:
in the case of a server-side dynamic webpage it might be necessary to attach some cookies to the request (for instance the identfing ones sent with the login);
in the case of a client-side dynamic webpage the resultig HTML file has some JavaScript embedded in it (delimited by the
<script>...</script>
tags).
rvest
can handle to some degree the server-side issue through sessions, but it is completly unable to deal with JavaScript. An interesting package is V8
, which provides an R interface to the homonym Google’s open source JavaScript engine. Sadly V8
is not the solution as it does not parse the entire page and render every JS script init. Instead, it requires the user to isolate a single script and provide and feed it to the engine. This approach can ultimately be used when the script that outputs the pice of information needed is known and the element to evaluate is not too complex. An example is the following code1
library(V8) engine <- v8() 'https://food.list.co.uk/place/22191-brewhemia-edinburgh/' %>% read_html() %>% html_nodes('li') %>% html_nodes('script') %>% html_text() %>% gsub('document.write','',.) %>% engine$eval() %>% read_html() %>% html_text()
CasperJS
The following steps should run “as they are” on macOS and with little modification on other UNIX-like operating systems. On Windows, paths and procedure may change significantly although the commands for scripting CasperJS should be the same.
What is really needed in these cases is a Headless Browsers which is scriptable, capable of rendering JavaScript, and able to store and send cookies. PhanotmJS does an excellent job providing all these features. Furthermore CasperJS improves the user experience by providing some higher level bindings to PhantomJS.
Installation
On Unix-like systems simply download PhantomJS from the official download page and place into a folder belonging to the system PATH
, for instance /usr/local/bin/
. On MacOS, shell code is the following:
curl -fsSL -o $TMPDIR/phantomjs-2.1.1-macosx.zip \ https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-macosx.zip unzip -qjoC $TMPDIR/phantomjs-2.1.1-macosx.zip */bin/phantomjs -d /usr/local/bin/
Next, to install CasperJS, download it into an installation folder of your choice (here I use $HOME/Library/
, altoght this will work only on macOS since Library
is a Darwin’s idiosyncratic directory) and then symlink the main executable to a folder belonging to the system PATH
.
curl -fsSL -o $TMPDIR/casperjs-1.1.4-2.zip https://github.com/casperjs/casperjs/archive/1.1.4-2.zip unzip -q $TMPDIR/casperjs-1.1.4-2.zip -d $HOME/Library/ ln -s $HOME/Library/casperjs-1.1.4-2/bin/casperjs /usr/local/bin/casperjs
To check that everything works smoothly it is enough to run (it takes around 1-2 minutes):
$ casperjs selftest
Scripting
CasperJS can be scripted via JavaScript. All it takes is to call the program from terminal along with a file (script.js
in this example) containing a set of instructions.
$ casperjs script.js
(of course assuming that script.js
is in the current working directory).
Bases
The instruction script must first of all import the casper
module. Then it can start surfing the url declared with .start()
. At the end of the script it is also mandatory to use .run()
method.
// script.js // This is the skeleton of a casperjs script. var casper = require('casper').create(); // import module casper.start(); // start browsing casper.run(); // execute the script
After .start()
, additional operations to execute on the currently opened page is given as a function either when calling .start()
or later by applying .then()
method. Another page is opened with .thenOpen()
without leaving the session.
Script 1 – Capture page’s screenshot
A screenshot of the webpage can be taken with this.capture()
so that it can be visually seen how CasperJS renders it. The script below will create a new screenshot.png
image in the working directory.
Also, it would be better not to go completly headless with some websites (especially when taking screenshots) or they may load the mobile version.
// script.js // Save a screenshot of the webpage var casper = require('casper').create({ pageSettings:{ viewportSize: {width: 1600, height: 1200}, // display size userAgent: 'Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/69.0' } }); casper.start('https://epaper.20minuti.ch/?locale=it#read/650/Ticino/2019-11-04/1'); casper.wait(1500); // give the browser some moments to load the epaper and all pictures (in ms) casper.then(function(){ this.capture('screenshot.png'); }); casper.run(); $ casperjs script.js
In most cases, CasperJS waits untill the page is fully loaded before executing the next task in the script. However, two cases come to my mind in which the only solution I found is to arbitrarily set a waiting time.
Some functions run asycronusly, meaning that the next task will start right arfter the previous one without waiting for its output.
Some sites show a loading page before showing the fully loaded page, which may be deciving for Casper. Comment out the
casper.wait()
ine in the above script and re-run it, to see how the screenshot changes.
Script 2 – Export HTML after JavaScript rendering
Most of the times, when we scrape a webpage we look for some information inside a node. CasperJS can print, to a file or to stdout, the final HTML page. The output can then be passed to R and analyzed using rvest
as it was a static website. To be fair I should say that CasperJS can extract attributes and text from nodes as well, but, since this guide is addressed to R users, I will try to use R whenever that is possible. It should also be noted that only scripts that would be redered at opening time will be evaluated, other may require some additional clicks (for instance if the searched element is behind a drop down menu or if some elements are divided into tabs).
The following script will send the HTML code to stdout.
// script.js // Get HTML from rendered page var casper = require('casper').create(); casper.start('https://daroczig.shinyapps.io/rinfinance_Berlinger-Daroczi-demo/'); casper.wait(5000); // shiny server takes a while to load // Also here it necessary to manually wait because casper consideres the page fullly loaded when it downloaded not when it is evaluated. // Comment the lline above out and take a screenshot to see the difference. casper.then(function(){ var html = this.getHTML() // save HTML to a variable this.echo(html) // print HTML to stdout }); casper.run();
Try it in bash:
$ casperjs script.js
At this point the output can be captured by calling CasperJS from R with system()
. Collapsing using \n
as separator is necessary as system()
returns a vector in which each element is a line of the console output (similarly to what readLines()
does with a file).
library(xml2) cmd <- 'casperjs script.js' html <- system(cmd, intern = TRUE) html <- paste(html, collapse = '\n') html <- xml2::read_html(html)
From here the html
object can be analyzed using rvest
as it was a static site.
Script 3 - Auto customize the CasperJS script
When there are many urls to scrape and the only change in the script is the target address, instead of making a new script for each site, it may be more resonable to make a template script and then interpolate the each link into the script before running it. If using stingr
for the interpolation2, the template could look like this:
// template.js // THIS IS A TEMPLATE FILE! // Placeholders must be replaced with working values before being usable. var casper = require('casper').create(); casper.start('${url}', function(){ this.echo(this.getHTML()) }); casper.run();
Back in R, when calling stringr::str_interp
, the charchter sequence ${url}
is replaced with the value of the object url
in the closest parent environment or in the supplied one (a list works aswell).
library(stringr) f <- file('template.js', 'r') js <- readLines(f) js <- paste0(js, collapse = '\n') js <- str_interp(js, env = list(url = 'https://www.google.com/')) cat(js, file = 'tmp_script.js') # save interpolated script to a temporary file called tmp_script.js close(f) # close connection to the template file
Running the above code chunk generates tmp_script.js
which looks like this
var casper = require('casper').create(); casper.start('https://www.google.com/', function(){ this.echo(this.getHTML()) }); casper.run();
This script can be run, the same way as before, with system('casperjs tmp_script.js', intern = TRUE)
.
Last, a function can be made to iterate more easily over a list of urls.
library(stringr) casper <- function(template_script, ...){ # Interpolate a CasperJS script and run it. # Use the ellipsis (...) to insert the variabels to interpolate # Interpolate and save to tmp file f <- file(template_script, 'r') js <- readLines(f) %>% paste0(collapse = '\n') %>% str_interp(env = list(...)) close(f) cat(js, file = 'tmp_script.js') # Run command cmd <- 'casperjs tmp_script.js' out <- system(cmd, intern = TRUE) %>% paste(collapse = '\n') # Remove temporary script system('rm tmp_script.js') return(out) } # Websites to render urls <- c( 'https://www.google.com/', 'https://www.yahoo.com/' ) htmls <- lapply(urls, function(x) casper(template_script = 'template.js', url = x))
Again a for loop could be performed directly in CasperJS, but I try to use R where it is possible. Moreover if the urls to scrape are generated from a query above in the workflow, this approach can be better integrated.
Script 4 - Login
It may happen that the target page is behind a login screen. In CasperJS it is posiible to fill login forms, emulate the “submit” click, and then open move to another page always within the same browser session. What the program does is storing all the cookies it gets and attach them to every request it sends during that session.
This is a section of the gitlab login page, more precisely login form.
<form class="new_user gl-show-field-errors" id="new_user" ...>... ... <input class="form-control top" ... name="user[login]" id="user_login"> ... ... <input class="form-control bottom" ... type="password" name="user[password]" id="user_password"> ... ... <input class="remember-me-checkbox" ... name="user[remember_me]" id="user_remember_me"> ... <input type="submit" name="commit" value="Sign in" class="btn btn-success" ...> </form>
When filling a form, the important nodes are form
and input
. These nodes can be identified by the attributes id
or name
(not always both are available). The syntax for refering to them is slightly different:
<node>#<id>
forid
<node>[name="<name>"]
forname
The code below includes some examples of both cases. It is used to login into gitlab.com. I am also taking a screenshot (logged.png
) to verify whether the login was successful or not.
// login.js var casper = require('casper').create(); casper.start('https://gitlab.com/users/sign_in'); casper.then(function(){ this.fillSelectors('form#new_user', { // using id attr 'input[name="user[login]"]':'*****', // using name attr 'input[name="user[password]"]':'*****', // using name attr 'input#user_remember_me':true // using id attr }, true); // 'true' to submit the form once it's been filled up }); // wait 1 seconds to allow authentication and redirection to the main page casper.wait(1000); casper.then(function(){ this.capture('logged.png') // look for it in the working directory }); casper.run();
After a successful login, it is possible browse as an authorized user with casper.thenOpen()
. However, once the session is closed (i.e. when the script finishes running), the cookies get deleted.
Script 5 - Get and send cookies
In case many scripts need to use the same credentials, re-do the login in each one of them, it is possible to store and load cookies in and from a cookiejar. The perceived effect is that of being inside the same session. This is not done inside the script, but by calling an arbitrary .txt
file when casperjs
command is executed. This file is contemporaneously read and written by casperjs, meaning that:
- if it does not exist it gets created;
- if a cookie is requested from a website and there is a corresponding entry in the file, it is loaded from there,
- if the session gets a cookie, the corresponding entry on the file is replaced if it exist or a new entry is created if it does not.
Last script above can now be split in two. Nevertheless, the resulting screenshot remains the same, as long as I pass the flag --cookies-file
to casperjs
(below I call the file simply gitlab_cookies.txt
).
// login.js var casper = require('casper').create(); casper.start('https://gitlab.com/users/sign_in'); casper.then(function(){ this.fillSelectors('form#new_user', { 'input[name="user[login]"]':'*****', 'input[name="user[password]"]':'*****', 'input#user_remember_me':true }, true); }); casper.wait(1000) casper.run(); // screenshot.js var casper = require('casper').create(); casper.start('https://gitlab.com/', function(){ this.capture('logged.png') }); casper.run(); $ casperjs --cookies-file=gitlab_cookies.txt login.js $ casperjs --cookies-file=gitlab_cookies.txt screenshot.js
Script 6 - Donwload files
CasperJS is by no means the most efficient way for downloading3 files. However, if the access to the file requires to send some cookies and a cookiejar is already available, it can perform this task as well. The method for that is simply .download()
.
// download.js var casper = require('casper').create(); casper.start('https://gitlab.com', function(){ this.download('https://raw.githubusercontent.com/casperjs/casperjs/master/README.md', 'casperjs-readme.md'); }); casper.run(); $ casperjs --cookies-file=my_cookies.txt download.js
Future development
Unluckely PhantomJS and CasperJS are currently unmantained since early 2018. However, they seem to cope well so far. Despite this, it is worth noting that CasperJS is only one possible headless browser. This repository on GitHub lists many of them, altough they are not all suitable for this kind of task. To me, Puppeteer and CEF Python (this last one can be integrated into R workflow with reticulate) are looking as the most promising alternatives.
That is only a small example of how CasperJS can be scripted. It offers many other methods, which can be consulted on github or on the official website.
- Code from
datascienceplus.com
^ - Alterntively also
glue
can be used for this task, but in this case the argument tagging would slighly change. ^ - Staying inside R,
curl
package offers a valid alternative, and eventually also anhandle
from anrvest
session can be passed to it. ^
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.