Use rvest to scrape NFL weather data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you are following my progress in the Data Science Learning Club you might know that I am using NFL data for the tasks. For predicting sports events I think it is not only important to have statistics about the players, teams and previous games but also about the weather. From when I was a soccer player I can tell you that it makes quite a difference whether it is snowing, has 30°C or more or the weather is moderate. One could argue that the weather is influence both teams and therefore no one has an advantage, but I think that everyone responds differently to different conditions.
The data source
After only searching for a short time, I found a website called NFLWeather which provides weather forecasts for every match back to 2009.
Web scraping: rvest
I have been looking into web scraping before, but it seemed like a dirty and cumbersome task to me.
Since I made the experience that almost everything related to data has been implemented in a nice way by someone in R I wanted to give it another try. I found the package rvest by @hadleywickham, which is always a very good sign with respect to R package quality.
The code
Checking out their archive I found the structure of their links and that they go back until 2009. So I wrote this method to parse the page, find the first table (there is only one), and convert it to a data.frame:
load_weather<-function(year, week) { base_url<-"http://nflweather.com/week/" if (year == 2010) { # necessary because of different file naming start_url<-paste0(base_url, year, "/", week, "-2/") } else { start_url<-paste0(base_url, year, "/", week, "/") } if (year == 2013 && week == "pro-bowl") { return (NULL) } tryCatch ({ page<-html(start_url, encoding="ISO-8859-1") table<-page %>% html_nodes("table") %>% .[[1]] %>% html_table() table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")]) return(table) }, error = function(e) { print(paste(e, "Year", y, "Week", w)) return(NULL) }). }
The function got a lot longer than anticipated, but let me explain it:
- Parameters: year and week
- start_url is built from the base_url that's always the same and the two parameters. The only difference is for year 2010, where for no apparent reason "-2" is added to each link.
- We have to skip the pro-bowl week in 2013, because that page does not exist.
- Then we have some error handling because other pages might not exist or might become unavailable.
- Line 13: I parse the page (actually html is deprecated and read_html should be used but I currently have an older version of R running).
- Line 14: I use the magrittr pipe operator as used in the package examples, but this can also be done without it. Just see the code below.
- Line 15: I create a data.frame only selecting the columns I need and by adding the Year and Week information to each row.
html_table(html_nodes(page, "table")[[1]])
This is how I call the code to build one large data.frame:
weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0)) for (y in years) { for (w in weeks) { weather_data<-rbind(weather_data, load_weather(y, w)) } }
The output
The output is a data.frame with 2832 rows just like the ones in the screenshot.
Download complete code
The complete source can be downloaded below.
library(rvest) years<-2009:2015 weeks<-c(paste0("pre-season-week-", 1:4), paste0("week-", 1:17), "wildcard-weekend", "divisional-playoffs", "conf-championships", "pro-bowl", "superbowl") load_weather<-function(year, week) { base_url<-"http://nflweather.com/week/" if (year == 2010) { # necessary because of different file naming start_url<-paste0(base_url, year, "/", week, "-2/") } else { start_url<-paste0(base_url, year, "/", week, "/") } if (year == 2013 && week == "pro-bowl") { return (NULL) } tryCatch ({ page<-html(start_url, encoding="ISO-8859-1") table<-page %>% html_nodes("table") %>% .[[1]] %>% html_table() table<-cbind("Year"=year, "Week"=week, table[,c("Away", "Home", "Forecast", "Extended Forecast", "Wind")]) return(table) }, error = function(e) { print(paste(e, "Year", y, "Week", w)) return(NULL) }) } weather_data<-data.frame("Year"=integer(0), "Week"=character(0), "Away"=character(0), "Home"=character(0), "Forecast"=character(0), "Extended Forecast"=character(0), "Wind"=character(0)) for (y in years) { for (w in weeks) { weather_data<-rbind(weather_data, load_weather(y, w)) } } #### code without pipe #### html_table(html_nodes(page, "table")[[1]])
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.