Scraping Dynamic Websites with PhantomJS
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a recent blogpost, I required data on the ELO ratings of national football teams over time. Such a list exists online at eloratings.net and so in theory this was just a simple task for rvest to read the html pages on that site and then fish out the data I wanted. However, while this works for the static websites which make up the vast majority of sites containing tables of data, it struggles with websites that use JavaScript to dynamically generate pages.
Eloratings.net is one such website which rvest is unable to scrape. E.g.
library(tidyverse) library(rvest) # url to data on Brazil's ELO rating over time url <- "https://eloratings.net/Brazil" read <- read_html(url) %>% # this is the CSS selector for the page title html_nodes("#mainheader") read ## {xml_nodeset (1)} ## [1] <h1 id="mainheader" class="mainheader"></h1>
does not manage to capture the data displayed in the page mainheader (it ‘should’ return “World Football Elo Ratings: Brazil” from the title of that page).
Instead, what we want to do is save a copy of the generated page as a .html file and then read that into R using read_html(). Luckily, a way exists to do just that, using the (now deprecated, but still working) PhantomJS headless browser. Much of the code I used to get going with this is adapted from a tutorial here.
First you want to install PhantomJS from the above website and run through it’s quick start guide. This is a pretty thorough guide, I would say that there are really only three steps from installation to getting going:
- Add phantomjs to the system PATH
- Open a text editor and save one of the tutorial scripts as filename.js
- run > phantomjs C:/Users/usr/path/to/file.js in a command line console
The file we’re going to use to render the js pages and then save the html is below:
// scrapes a given url (for eloratings.net) // create a webpage object var page = require('webpage').create(), system = require('system') // the url for each country provided as an argument country= system.args[1]; // include the File System module for writing to files var fs = require('fs'); // specify source and path to output file // we'll just overwirte iteratively to a page in the same directory var path = 'elopage.html' page.open(country, function (status) { var content = page.content; fs.write(path,content,'w') phantom.exit(); });
(which, again, is stolen and adapted from here)
This is saved as scrape_ELO.js in the static directory of my blog folder.
To keep everything in R, we can use the system() family of functions, which provides access to the OS command line. Though the referenced tutorial uses system(), it relies on scraping a single referenced page. To iteratively scrape every country, we’ll need to provide an argument (country) which will contain the link to the page on eloratings.net for that country.
E.g. for Brazil we will provide “https://www.eloratings.net/Brazil” as the country argument
phantom_dir <- "C:/Users/path/to/scrape_ELO/" country_url <- "https://www.eloratings.net/Brazil" # use system2 to invoke phantomjs via it's executable system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", #provide the path to the scraping script and the country url as argument args = c(file.path(phantom_dir, "scrape_ELO.js"), country_url))
We can then read in this saved html page using rvest as per usual and recover the information therein.
# read in the saved html file page <- read_html("elopage.html") # scrape with rvest as normal country_name <- page %>% html_nodes("#mainheader") %>% html_text() %>% gsub("Elo Ratings: ", "", .) country_name ## [1] "Brazil"
I’m not going to include my full script for scraping eloratings.net as usually a reason for doing this obscuring of the data is to prevent exactly what I’m doing. Instead I’ll give a skeleton function of the one I use. If you are having problems with setting up phantomjs to scrape pages, my contact details are listed on my blog homepage
scrape_nation <- function(country) { # download the page url <- paste0("https://eloratings.net/", country) system2("C:/Users/path/to/phantomjs-2.1.1-windows/bin/phantomjs.exe", args = c(file.path(phantom_dir, "scrape_ELO.js"), url)) # read in downloaded page page <- read_html("elopage.html") # recover information country_name <- page %>% html_nodes("#mainheader") %>% html_text() %>% gsub("Elo Ratings: ", "", .) opposing <- page %>% html_nodes(".r1 a") %>% html_text() teams <- page %>% html_nodes(".r1") fixtures <- map2_df(teams, opposing, split_teams) ratings <- page %>% html_nodes(".r4") %>% html_text() %>% map_df(., split_ratings) rankings <- page %>% html_nodes(".r6") %>% map_df(., split_rankings) dates <- page %>% html_nodes(".r0") %>% html_text() %>% map_df(., convert_date) # bind into a data frame df <- fixtures %>% cbind(., ratings) %>% cbind(., rankings) %>% cbind(., dates) %>% mutate(table_country = country_name) } elO_data <- map_df(country_links, scrape_nation)
Finally, we want to convert this to long format. We have two observations per country and any point in time- the rating, and the ranking. For the blogpost I needed the data for I took just the ranking data in the end. Here, I’m going to do the opposite and take only the rating data to make a nice little plot of national teams ratings over time
elo_data %<>% mutate(date = as.Date(date)) %>% # rename and select variables select( date, home, away, rating_home = r1, rating_away = r2, ranking_home = ranking1, ranking_away = ranking2 ) %>% # melt twice to convert to long format gather( "location", "nation", -rating_home, -rating_away, -ranking_home, -ranking_away, -date ) %>% gather("measure", "value", -nation, -date, -location) %>% # take only relevant information filter( (location == "home" & measure %in% c("rating_home", "ranking_home")) | (location == "away" & measure %in% c("rating_away", "ranking_away")) ) %>% separate(measure, into = c("measure", "location"), "_") %>% # filter out relevant data filter(!duplicated(.)) %>% filter(date > "1950-01-01") %>% filter(measure == "rating") %>% select(-measure, rating = value, -location) # print the df head(elo_data) ## date nation rating ## 1 1950-05-06 Brazil 1957 ## 2 1950-05-07 Brazil 1969 ## 3 1950-05-13 Brazil 1961 ## 4 1950-05-14 Brazil 1965 ## 5 1950-05-18 Brazil 1969 ## 6 1950-06-24 Brazil 1991
To cap off this little post, I decided to use gganimate to show how the ratings of some nations have changed over time. It’s a nice little sanity test that we’ve scraped data correctly, but also, as a football nerd, I enjoy seeing how nations have risen and fallen over the years
library(gganimate) p <- elo_data %>% # select out a few nations filter(nation %in% c( "Brazil", "England", "Canada", "Hungary", "Nigeria", "Japan" )) %>% # going to take the average over every 4 months # could use zoo::rollmean but also want to cut down plotting mutate(month = as.numeric(format(date, "%m")), year = as.numeric(format(date, "%Y"))) %>% mutate(third = case_when( month < 5 ~ 0, month < 9 ~ 33, TRUE ~ 66 )) %>% mutate(year = as.numeric(paste0(year, ".", third))) %>% group_by(nation, year) %>% summarise(rating_av = mean(rating)) %>% ungroup() %>% # pipe into ggplot ggplot(aes(x = year, y = rating_av, group = nation)) + # coloured line per nations geom_line(size = 1.5, aes(colour = nation)) + scale_colour_manual(values = c("goldenrod", "red", "grey60", "green", "darkblue", "forestgreen")) + labs(title = "ELO Rating of Selected Nations over Time", subtitle = "date from eloratings.net", x = "year", y = "ELO rating") + theme_minimal() + theme(legend.position="bottom") + # gganimate reveal transition_reveal(year) # save the gif gif <- animate(p, nframes = 20)
Which if we render gives us
!()[https://i.imgur.com/4cgBX48.gif]
And the data looks good! The Mighty Magyar Hungary team of the 1950s can be seen to peak before the nations long decline, whereas the opposite is true for Japan. Overall, I’m pretty happy with the result. It could surely be cleaned up using rolling means and more careful plotting, but for a small example to plot the output from the scraping (the real point for this post) it serves a purpose.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.