Plot the Scoring Streak of an NHL Player with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am a big Boston Bruins fan and have enjoyed the ups and downs over the last few years, regardless of the catastrophes that have occurred during the playoffs. The team struggled a few weeks ago, but have recently seemed to find their stride.
During that time frame, in my opinion Nathan Horton was a significant factor in those wins. For a long chunk of the season though, it felt like he was in a rut. It got me thinking about how we could actually use some data to see how “streaky” a player is. In this case, Nathan Horton.
The code below uses R to collect what we need from the web and plot the cumulative goals over the course of a season. Each dot on the line should represent a game, so it is easy to view games played versus production across each season of a career.
A big thank you to @Bernd on stackoverflow for his help and Hadley for making some great R packages. Also, I am starting to warm up to R Studio and think its a great tool for those coming from software like SAS or SPSS..
Resulting plot:
## basics | |
# R 2.12.2 | |
# windows xp; Yes, I know | |
## libraries | |
library(XML) | |
library(plyr) | |
library(lubridate) | |
library(ggplot2) | |
# Set the working directory | |
setwd("~/My Dropbox/Eclipse/Projects/R/NHL/Blog Posts/Player Streakiness") | |
# Set the constants | |
BASE <- "http://www.hockey-reference.com/players/h/hortona01/gamelog/" | |
SEASON <- c(2004, 2006:2011) | |
# Loop and grab the data | |
ds <- data.frame() | |
for (S in SEASON) { | |
URL <- paste(BASE, S, "/", sep="") | |
tables <- readHTMLTable(URL)$stats | |
head(tables, n=30) | |
# fix factors and names | |
for(i in 1:ncol(tables)) { | |
tables[,i] <- as.character(tables[,i]) | |
names(tables) <- tolower(colnames(tables)) | |
} | |
tables | |
str(tables) | |
names(tables)[6] <- "AwayHome" | |
names(tables)[8] <- "WinLoss" | |
names(tables)[9] <- "goals" | |
names(tables) | |
# fix the columns - NAs forced by coercion warnings | |
str(tables) | |
for(i in c(1:2, 9:19)) { | |
tables[,i] <- as.numeric(tables[, i]) | |
} | |
str(tables) | |
tables$year <- S | |
ds <- rbind.fill(ds, tables) | |
# BE KIND when scraping | |
Sys.sleep(10) | |
} | |
with(ds, table(year)) | |
head(ds, n=30) | |
dim(ds) | |
ds<- ds[!is.na(ds$rk), ] | |
dim(ds) | |
head(ds, n=30) | |
save(ds, file="Horton.Rdata") | |
# Need to change the date to an actual date in R | |
str(ds) | |
ds$date <- parse_date(ds$date, c("%Y", "%m", "%d"), seps="-") | |
str(ds) | |
# Format to the month year = do so by setting all with the same arbitrary year | |
# Set the last months of the season to the year plus 1 so the dates are in "order" when plotted | |
ds$date <- update(ds$date, year=2010) | |
ds$date[month(ds$date) < 10] <- update(ds$date[month(ds$date) < 10], year=2011) | |
head(ds, n=40) | |
# Help recieved from | |
# http://stackoverflow.com/questions/5494216/extract-date-in-r | |
# add cumulative goals by season and make a new dataframe | |
gamelog <- ddply(ds, .(year), transform, cumegoals = cumsum(goals)) | |
# plot the data | |
ggplot(aes(y=cumegoals, x=date), data=gamelog) + geom_point() + geom_line() + | |
facet_wrap(~year, ncol=1) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.