Plot the Scoring Streak of an NHL Player with R

[This article was first published on Data Twirling » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am a big Boston Bruins fan and have enjoyed the ups and downs over the last few years, regardless of the catastrophes that have occurred during the playoffs.  The team struggled a few weeks ago, but have recently seemed to find their stride.

During that time frame, in my opinion Nathan Horton was a significant factor in those wins. For a long chunk of the season though, it felt like he was in a rut.   It got me thinking about how we could actually use some data to see how “streaky” a player is.  In this case, Nathan Horton.

The code below uses R to collect what we need from the web and plot the cumulative goals over the course of a season. Each dot on the line should represent a game, so it is easy to view games played versus production across each season of a career.

A big thank you to @Bernd on  stackoverflow for his help and Hadley for making some great R packages.  Also, I am starting to warm up to R Studio and think its a great tool for those coming from software like SAS or SPSS..

Resulting plot:

## basics
# R 2.12.2
# windows xp; Yes, I know
## libraries
library(XML)
library(plyr)
library(lubridate)
library(ggplot2)
# Set the working directory
setwd("~/My Dropbox/Eclipse/Projects/R/NHL/Blog Posts/Player Streakiness")
# Set the constants
BASE <- "http://www.hockey-reference.com/players/h/hortona01/gamelog/"
SEASON <- c(2004, 2006:2011)
# Loop and grab the data
ds <- data.frame()
for (S in SEASON) {
URL <- paste(BASE, S, "/", sep="")
tables <- readHTMLTable(URL)$stats
head(tables, n=30)
# fix factors and names
for(i in 1:ncol(tables)) {
tables[,i] <- as.character(tables[,i])
names(tables) <- tolower(colnames(tables))
}
tables
str(tables)
names(tables)[6] <- "AwayHome"
names(tables)[8] <- "WinLoss"
names(tables)[9] <- "goals"
names(tables)
# fix the columns - NAs forced by coercion warnings
str(tables)
for(i in c(1:2, 9:19)) {
tables[,i] <- as.numeric(tables[, i])
}
str(tables)
tables$year <- S
ds <- rbind.fill(ds, tables)
# BE KIND when scraping
Sys.sleep(10)
}
with(ds, table(year))
head(ds, n=30)
dim(ds)
ds<- ds[!is.na(ds$rk), ]
dim(ds)
head(ds, n=30)
save(ds, file="Horton.Rdata")
# Need to change the date to an actual date in R
str(ds)
ds$date <- parse_date(ds$date, c("%Y", "%m", "%d"), seps="-")
str(ds)
# Format to the month year = do so by setting all with the same arbitrary year
# Set the last months of the season to the year plus 1 so the dates are in "order" when plotted
ds$date <- update(ds$date, year=2010)
ds$date[month(ds$date) < 10] <- update(ds$date[month(ds$date) < 10], year=2011)
head(ds, n=40)
# Help recieved from
# http://stackoverflow.com/questions/5494216/extract-date-in-r
# add cumulative goals by season and make a new dataframe
gamelog <- ddply(ds, .(year), transform, cumegoals = cumsum(goals))
# plot the data
ggplot(aes(y=cumegoals, x=date), data=gamelog) + geom_point() + geom_line() +
facet_wrap(~year, ncol=1)
view raw Player Streak.R hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Data Twirling » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)