Site icon R-bloggers

Scraping Dynamic Websites with PhantomJS

[This article was first published on rstats on Robert Hickman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For a recent blogpost, I required data on the ELO ratings of national football teams over time. Such a list exists online at eloratings.net and so in theory this was just a simple task for rvest to read the html pages on that site and then fish out the data I wanted. However, while this works for the static websites which make up the vast majority of sites containing tables of data, it struggles with websites that use JavaScript to dynamically generate pages.

Eloratings.net is one such website which rvest is unable to scrape. E.g.

library(tidyverse)
library(rvest)

# url to data on Brazil's ELO rating over time
url <- "https://eloratings.net/Brazil"

read <- read_html(url) %>%
  # this is the CSS selector for the page title
  html_nodes("#mainheader")

read
## {xml_nodeset (1)}
## [1] <h1 id="mainheader" class="mainheader"></h1>

does not manage to capture the data displayed in the page mainheader (it ‘should’ return “World Football Elo Ratings: Brazil” from the title of that page).

Instead, what we want to do is save a copy of the generated page as a .html file and then read that into R using read_html(). Luckily, a way exists to do just that, using the (now deprecated, but still working) PhantomJS headless browser. Much of the code I used to get going with this is adapted from a tutorial here.

First you want to install PhantomJS from the above website and run through it’s quick start guide. This is a pretty thorough guide, I would say that there are really only three steps from installation to getting going:

  1. Add phantomjs to the system PATH
  2. Open a text editor and save one of the tutorial scripts as filename.js
  3. run > phantomjs C:/Users/usr/path/to/file.js in a command line console

The file we’re going to use to render the js pages and then save the html is below:

// scrapes a given url (for eloratings.net)

// create a webpage object
var page = require('webpage').create(),
  system = require('system')

// the url for each country provided as an argument
country= system.args[1];

// include the File System module for writing to files
var fs = require('fs');

// specify source and path to output file
// we'll just overwirte iteratively to a page in the same directory
var path = 'elopage.html'

page.open(country, function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

(which, again, is stolen and adapted from here)

This is saved as scrape_ELO.js in the static directory of my blog folder.

To keep everything in R, we can use the system() family of functions, which provides access to the OS command line. Though the referenced tutorial uses system(), it relies on scraping a single referenced page. To iteratively scrape every country, we’ll need to provide an argument (country) which will contain the link to the page on eloratings.net for that country.

E.g. for Brazil we will provide “https://www.eloratings.net/Brazil” as the country argument

phantom_dir <- "C:/Users/path/to/scrape_ELO/"
country_url <- "https://www.eloratings.net/Brazil"

# use system2 to invoke phantomjs via it's executable
system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe",
        #provide the path to the scraping script and the country url as argument
        args = c(file.path(phantom_dir, "scrape_ELO.js"), country_url))

We can then read in this saved html page using rvest as per usual and recover the information therein.

# read in the saved html file
page <- read_html("elopage.html")

# scrape with rvest as normal
country_name <- page %>%
  html_nodes("#mainheader") %>%
  html_text() %>%
  gsub("Elo Ratings: ", "", .)

country_name
## [1] "Brazil"

I’m not going to include my full script for scraping eloratings.net as usually a reason for doing this obscuring of the data is to prevent exactly what I’m doing. Instead I’ll give a skeleton function of the one I use. If you are having problems with setting up phantomjs to scrape pages, my contact details are listed on my blog homepage

scrape_nation <- function(country) {
  # download the page
  url <- paste0("https://eloratings.net/", country)
  system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", 
          args = c(file.path(phantom_dir, "scrape_ELO.js"), url))
  
  # read in downloaded page
  page <- read_html("elopage.html")
  
  # recover information
  country_name <- page %>%
    html_nodes("#mainheader") %>%
    html_text() %>%
    gsub("Elo Ratings: ", "", .)
  
  opposing <- page %>%
      html_nodes(".r1 a") %>%
      html_text()
  
  teams <- page %>%
      html_nodes(".r1")
  
  fixtures <- map2_df(teams, opposing, split_teams)

  ratings <- page %>%
    html_nodes(".r4") %>%
    html_text() %>%
    map_df(., split_ratings)
  
  rankings <- page %>%
    html_nodes(".r6") %>%
    map_df(., split_rankings)

  dates <- page %>%
    html_nodes(".r0") %>%
    html_text() %>%
    map_df(., convert_date)

  # bind into a data frame
  df <- fixtures %>%
    cbind(., ratings) %>%
    cbind(., rankings) %>%
    cbind(., dates) %>%
    mutate(table_country = country_name)
}

elO_data <- map_df(country_links, scrape_nation)

Finally, we want to convert this to long format. We have two observations per country and any point in time- the rating, and the ranking. For the blogpost I needed the data for I took just the ranking data in the end. Here, I’m going to do the opposite and take only the rating data to make a nice little plot of national teams ratings over time

elo_data %<>%
  mutate(date = as.Date(date)) %>%
  # rename and select variables
  select(
    date,
    home, away,
    rating_home = r1, rating_away = r2,
    ranking_home = ranking1, ranking_away = ranking2
  ) %>%
  # melt twice to convert to long format
  gather(
    "location", "nation",
    -rating_home, -rating_away, -ranking_home, -ranking_away, -date
  ) %>%
  gather("measure", "value", -nation, -date, -location) %>%
  # take only relevant information
  filter(
    (location == "home" & measure %in% c("rating_home", "ranking_home")) |
      (location == "away" & measure %in% c("rating_away", "ranking_away"))
  ) %>%
  separate(measure, into = c("measure", "location"), "_") %>%
  # filter out relevant data
  filter(!duplicated(.)) %>%
  filter(date > "1950-01-01") %>%
  filter(measure == "rating") %>%
  select(-measure, rating = value, -location)

# print the df
head(elo_data)
##         date nation rating
## 1 1950-05-06 Brazil   1957
## 2 1950-05-07 Brazil   1969
## 3 1950-05-13 Brazil   1961
## 4 1950-05-14 Brazil   1965
## 5 1950-05-18 Brazil   1969
## 6 1950-06-24 Brazil   1991

To cap off this little post, I decided to use gganimate to show how the ratings of some nations have changed over time. It’s a nice little sanity test that we’ve scraped data correctly, but also, as a football nerd, I enjoy seeing how nations have risen and fallen over the years

library(gganimate)

p <- elo_data %>%
  # select out a few nations
  filter(nation %in% c(
    "Brazil",
    "England",
    "Canada",
    "Hungary",
    "Nigeria",
    "Japan"
  )) %>%
  # going to take the average over every 4 months
  # could use zoo::rollmean but also want to cut down plotting
  mutate(month = as.numeric(format(date, "%m")),
         year = as.numeric(format(date, "%Y"))) %>%
  mutate(third = case_when(
    month < 5 ~ 0,
    month < 9 ~ 33,
    TRUE ~ 66
  )) %>%
  mutate(year = as.numeric(paste0(year, ".", third))) %>%
  group_by(nation, year) %>%
  summarise(rating_av = mean(rating)) %>%
  ungroup() %>%
  # pipe into ggplot
  ggplot(aes(x = year, y = rating_av, group = nation)) +
  # coloured line per nations
  geom_line(size = 1.5, aes(colour = nation)) +
  scale_colour_manual(values = c("goldenrod", "red", "grey60", "green", "darkblue", "forestgreen")) +
  labs(title = "ELO Rating of Selected Nations over Time",
       subtitle = "date from eloratings.net",
       x = "year",
       y = "ELO rating") +
  theme_minimal() +
  theme(legend.position="bottom") +
  # gganimate reveal
  transition_reveal(year)

# save the gif
gif <- animate(p, nframes = 20)

Which if we render gives us

!()[https://i.imgur.com/4cgBX48.gif]

And the data looks good! The Mighty Magyar Hungary team of the 1950s can be seen to peak before the nations long decline, whereas the opposite is true for Japan. Overall, I’m pretty happy with the result. It could surely be cleaned up using rolling means and more careful plotting, but for a small example to plot the output from the scraping (the real point for this post) it serves a purpose.

To leave a comment for the author, please follow the link and comment on their blog: rstats on Robert Hickman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.