When rvest is not enough

r on Giovanni Kraushaar

2 years ago

[This article was first published on r on Giovanni Kraushaar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An intorduction to CasperJS for R users

Web scraping is a data mining technique that allows to transform data from unstructured (widely available on the interent under the form of webpages) to structured datasets. This is mostly done exploiting the fact that webpages have some sort of structure, in the form of xml markup language. An xml file organizes content inside nested nodes (or tags when talking about html), all of which have attributes and contents.

Rvest Limitations

Hadley Wickham’s rvest is an excellent tool for scraping websites. All it takes is to provide the url of the site, the nodes of interest and which attributes extract from those nodes. To give an example, we can scrape all the links to my socials in the main page of my website with few lines of code:

library(rvest)

'http://giovannikraushaar.ch' %>% 
  read_html() %>%
  html_node('div#home-social') %>%
  html_nodes('a') %>%
  html_attr('href')

This workflow works wonderfully as long as the web page is static. Nevertheless, nowadays there are many pages that are dynamic. This means that:

in the case of a server-side dynamic webpage it might be necessary to attach some cookies to the request (for instance the identfing ones sent with the login);
in the case of a client-side dynamic webpage the resultig HTML file has some JavaScript embedded in it (delimited by the <script>...</script> tags).

rvest can handle to some degree the server-side issue through sessions, but it is completly unable to deal with JavaScript. An interesting package is V8, which provides an R interface to the homonym Google’s open source JavaScript engine. Sadly V8 is not the solution as it does not parse the entire page and render every JS script init. Instead, it requires the user to isolate a single script and provide and feed it to the engine. This approach can ultimately be used when the script that outputs the pice of information needed is known and the element to evaluate is not too complex. An example is the following code¹

library(V8)
engine <- v8()

'https://food.list.co.uk/place/22191-brewhemia-edinburgh/' %>%
  read_html() %>% 
  html_nodes('li') %>% 
  html_nodes('script') %>% 
  html_text() %>%
  gsub('document.write','',.) %>%
  engine$eval() %>%
  read_html() %>%
  html_text()

CasperJS

The following steps should run “as they are” on macOS and with little modification on other UNIX-like operating systems. On Windows, paths and procedure may change significantly although the commands for scripting CasperJS should be the same.

What is really needed in these cases is a Headless Browsers which is scriptable, capable of rendering JavaScript, and able to store and send cookies. PhanotmJS does an excellent job providing all these features. Furthermore CasperJS improves the user experience by providing some higher level bindings to PhantomJS.

Installation

On Unix-like systems simply download PhantomJS from the official download page and place into a folder belonging to the system PATH, for instance /usr/local/bin/. On MacOS, shell code is the following:

curl -fsSL -o $TMPDIR/phantomjs-2.1.1-macosx.zip \
https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-macosx.zip
unzip -qjoC $TMPDIR/phantomjs-2.1.1-macosx.zip */bin/phantomjs -d /usr/local/bin/

Next, to install CasperJS, download it into an installation folder of your choice (here I use $HOME/Library/, altoght this will work only on macOS since Library is a Darwin’s idiosyncratic directory) and then symlink the main executable to a folder belonging to the system PATH.

curl -fsSL -o $TMPDIR/casperjs-1.1.4-2.zip https://github.com/casperjs/casperjs/archive/1.1.4-2.zip
unzip -q $TMPDIR/casperjs-1.1.4-2.zip -d $HOME/Library/
ln -s $HOME/Library/casperjs-1.1.4-2/bin/casperjs /usr/local/bin/casperjs

To check that everything works smoothly it is enough to run (it takes around 1-2 minutes):

$ casperjs selftest

Scripting

CasperJS can be scripted via JavaScript. All it takes is to call the program from terminal along with a file (script.js in this example) containing a set of instructions.

$ casperjs script.js

(of course assuming that script.js is in the current working directory).

Bases

The instruction script must first of all import the casper module. Then it can start surfing the url declared with .start(). At the end of the script it is also mandatory to use .run() method.

// script.js
// This is the skeleton of a casperjs script.

var casper = require('casper').create();  // import module

casper.start();  // start browsing

casper.run();  // execute the script

After .start(), additional operations to execute on the currently opened page is given as a function either when calling .start() or later by applying .then() method. Another page is opened with .thenOpen() without leaving the session.

Script 1 – Capture page’s screenshot

A screenshot of the webpage can be taken with this.capture() so that it can be visually seen how CasperJS renders it. The script below will create a new screenshot.png image in the working directory.

Also, it would be better not to go completly headless with some websites (especially when taking screenshots) or they may load the mobile version.

// script.js
// Save a screenshot of the webpage

var casper = require('casper').create({
    pageSettings:{
    	viewportSize: {width: 1600, height: 1200},  // display size
    	userAgent: 'Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/69.0'
    }
});

casper.start('https://epaper.20minuti.ch/?locale=it#read/650/Ticino/2019-11-04/1');

casper.wait(1500); // give the browser some moments to load the epaper and all pictures (in ms)

casper.then(function(){
  this.capture('screenshot.png');
});

casper.run();

$ casperjs script.js

In most cases, CasperJS waits untill the page is fully loaded before executing the next task in the script. However, two cases come to my mind in which the only solution I found is to arbitrarily set a waiting time.

Some functions run asycronusly, meaning that the next task will start right arfter the previous one without waiting for its output.
Some sites show a loading page before showing the fully loaded page, which may be deciving for Casper. Comment out the casper.wait() ine in the above script and re-run it, to see how the screenshot changes.

Script 2 – Export HTML after JavaScript rendering

Most of the times, when we scrape a webpage we look for some information inside a node. CasperJS can print, to a file or to stdout, the final HTML page. The output can then be passed to R and analyzed using rvest as it was a static website. To be fair I should say that CasperJS can extract attributes and text from nodes as well, but, since this guide is addressed to R users, I will try to use R whenever that is possible. It should also be noted that only scripts that would be redered at opening time will be evaluated, other may require some additional clicks (for instance if the searched element is behind a drop down menu or if some elements are divided into tabs).

The following script will send the HTML code to stdout.

// script.js
// Get HTML from rendered page

var casper = require('casper').create();

casper.start('https://daroczig.shinyapps.io/rinfinance_Berlinger-Daroczi-demo/');

casper.wait(5000);  // shiny server takes a while to load
// Also here it necessary to manually wait because casper consideres the page fullly loaded when it downloaded not when it is evaluated. 
// Comment the lline above out and take a screenshot to see the difference.


casper.then(function(){
  var html = this.getHTML()  // save HTML to a variable
  this.echo(html)  // print HTML to stdout
});

casper.run();

Try it in bash:

$ casperjs script.js

At this point the output can be captured by calling CasperJS from R with system(). Collapsing using \n as separator is necessary as system() returns a vector in which each element is a line of the console output (similarly to what readLines() does with a file).

library(xml2)

cmd <- 'casperjs script.js'
html <- system(cmd, intern = TRUE)
html <- paste(html, collapse = '\n')
html <- xml2::read_html(html)

From here the html object can be analyzed using rvest as it was a static site.

Script 3 – Auto customize the CasperJS script

When there are many urls to scrape and the only change in the script is the target address, instead of making a new script for each site, it may be more resonable to make a template script and then interpolate the each link into the script before running it. If using stingr for the interpolation², the template could look like this:

// template.js
// THIS IS A TEMPLATE FILE! 
// Placeholders must be replaced with working values before being usable.

var casper = require('casper').create();

casper.start('${url}', function(){
    this.echo(this.getHTML())
});

casper.run();

Back in R, when calling stringr::str_interp, the charchter sequence ${url} is replaced with the value of the object url in the closest parent environment or in the supplied one (a list works aswell).

library(stringr)

f <- file('template.js', 'r')
js <- readLines(f)
js <- paste0(js, collapse = '\n')

js <- str_interp(js, env = list(url = 'https://www.google.com/'))

cat(js, file = 'tmp_script.js')  # save interpolated script to a temporary file called tmp_script.js 

close(f)  # close connection to the template file

Running the above code chunk generates tmp_script.js which looks like this

var casper = require('casper').create();

casper.start('https://www.google.com/', function(){
    this.echo(this.getHTML())
});

casper.run();

This script can be run, the same way as before, with system('casperjs tmp_script.js', intern = TRUE).

Last, a function can be made to iterate more easily over a list of urls.

library(stringr)

casper <- function(template_script, ...){
  # Interpolate a CasperJS script and run it.
  # Use the ellipsis (...) to insert the variabels to interpolate
  
  # Interpolate and save to tmp file
  f <- file(template_script, 'r')
  
  js <- readLines(f) %>% 
    paste0(collapse = '\n') %>%
    str_interp(env = list(...))
  close(f)
  
  cat(js, file = 'tmp_script.js')
  
  
  # Run command
  cmd <- 'casperjs tmp_script.js'
  out <- system(cmd, intern = TRUE) %>% 
    paste(collapse = '\n')
  
  # Remove temporary script
  system('rm tmp_script.js')
  
  return(out)
}


# Websites to render
urls <- c(
  'https://www.google.com/',
  'https://www.yahoo.com/'
)

htmls <- lapply(urls, function(x) casper(template_script = 'template.js', url = x))

Again a for loop could be performed directly in CasperJS, but I try to use R where it is possible. Moreover if the urls to scrape are generated from a query above in the workflow, this approach can be better integrated.

Script 4 – Login

It may happen that the target page is behind a login screen. In CasperJS it is posiible to fill login forms, emulate the “submit” click, and then open move to another page always within the same browser session. What the program does is storing all the cookies it gets and attach them to every request it sends during that session.

This is a section of the gitlab login page, more precisely login form.

<form class="new_user gl-show-field-errors" id="new_user" ...>...
  ...
    <input class="form-control top" ... name="user[login]" id="user_login">
  ...
  ...
    <input class="form-control bottom" ... type="password" name="user[password]" id="user_password">
  ...
  ...
    <input class="remember-me-checkbox" ... name="user[remember_me]" id="user_remember_me">
  ...
 <input type="submit" name="commit" value="Sign in" class="btn btn-success" ...>
</form>

When filling a form, the important nodes are form and input. These nodes can be identified by the attributes id or name (not always both are available). The syntax for refering to them is slightly different:

<node>#<id> for id
<node>[name="<name>"] for name

The code below includes some examples of both cases. It is used to login into gitlab.com. I am also taking a screenshot (logged.png) to verify whether the login was successful or not.

// login.js

var casper = require('casper').create();

casper.start('https://gitlab.com/users/sign_in');

casper.then(function(){
	this.fillSelectors('form#new_user', {        // using id attr
    	'input[name="user[login]"]':'*****',     // using name attr
    	'input[name="user[password]"]':'*****',  // using name attr
    	'input#user_remember_me':true            // using id attr
    }, true);   // 'true' to submit the form once it's been filled up 
});

// wait 1 seconds to allow authentication and redirection to the main page
casper.wait(1000);

casper.then(function(){
	this.capture('logged.png') // look for it in the working directory
});

casper.run();

After a successful login, it is possible browse as an authorized user with casper.thenOpen(). However, once the session is closed (i.e. when the script finishes running), the cookies get deleted.

Script 5 – Get and send cookies

In case many scripts need to use the same credentials, re-do the login in each one of them, it is possible to store and load cookies in and from a cookiejar. The perceived effect is that of being inside the same session. This is not done inside the script, but by calling an arbitrary .txt file when casperjs command is executed. This file is contemporaneously read and written by casperjs, meaning that:

if it does not exist it gets created;
if a cookie is requested from a website and there is a corresponding entry in the file, it is loaded from there,
if the session gets a cookie, the corresponding entry on the file is replaced if it exist or a new entry is created if it does not.

Last script above can now be split in two. Nevertheless, the resulting screenshot remains the same, as long as I pass the flag --cookies-file to casperjs (below I call the file simply gitlab_cookies.txt).

// login.js

var casper = require('casper').create();

casper.start('https://gitlab.com/users/sign_in');

casper.then(function(){
	this.fillSelectors('form#new_user', {
    	'input[name="user[login]"]':'*****',
    	'input[name="user[password]"]':'*****',
    	'input#user_remember_me':true
    }, true);
});

casper.wait(1000)

casper.run();

// screenshot.js

var casper = require('casper').create();

casper.start('https://gitlab.com/', function(){
  this.capture('logged.png')
});

casper.run();

$ casperjs --cookies-file=gitlab_cookies.txt login.js
$ casperjs --cookies-file=gitlab_cookies.txt screenshot.js

Script 6 – Donwload files

CasperJS is by no means the most efficient way for downloading³ files. However, if the access to the file requires to send some cookies and a cookiejar is already available, it can perform this task as well. The method for that is simply .download().

// download.js

var casper = require('casper').create();

casper.start('https://gitlab.com', function(){
  this.download('https://raw.githubusercontent.com/casperjs/casperjs/master/README.md', 'casperjs-readme.md');
});

casper.run();

$ casperjs --cookies-file=my_cookies.txt download.js

Future development

Unluckely PhantomJS and CasperJS are currently unmantained since early 2018. However, they seem to cope well so far. Despite this, it is worth noting that CasperJS is only one possible headless browser. This repository on GitHub lists many of them, altough they are not all suitable for this kind of task. To me, Puppeteer and CEF Python (this last one can be integrated into R workflow with reticulate) are looking as the most promising alternatives.

That is only a small example of how CasperJS can be scripted. It offers many other methods, which can be consulted on github or on the official website.

Code from datascienceplus.com ^{^}
Alterntively also glue can be used for this task, but in this case the argument tagging would slighly change. ^{^}
Staying inside R, curl package offers a valid alternative, and eventually also an handle from an rvest session can be passed to it. ^{^}

To leave a comment for the author, please follow the link and comment on their blog: r on Giovanni Kraushaar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.