Find NHL Players with 30 Goals and 100 PIM using R

[This article was first published on Data Twirling » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week Jack Edwards raised the fact that Milan Lucic was the first Bruin player to join the 30 Goal / 100 Penalty Minute club in a few years.  It got me thinking about the other players who have accomplished this feat and wanted to hack up some data using R.  The code below grabs the necessary data and generates a very basic analysis of all skaters since 1960.

It appears that this type of player was more prevalent during the 80′s and has declined in recent years.  If you are skilled with R, I apologize for the sloppy code.  As always, if you grabbing data from external websites, please save your data and be kind servers when scraping large volumes of data.

 

# Source the file with the Grab Skaters function
setwd("~/My Dropbox/Eclipse/Projects/R/NHL/Blog Posts/Scripts")
source("SkaterStats.R")
## set the working directory
setwd("~/My Dropbox/Eclipse/Projects/R/NHL/Blog Posts/30 Goals 100 PIM")
## set the seasons
SEASON <- as.character(c(1960:2004, 2006:2011))
#-----------------------------------------------------------------------
# Use the function to loop over the seasons and piece together
#-----------------------------------------------------------------------
## define the seasons -- 2005 dataset doesnt exist
## if I was a good coder I would trap the error, but this works
SEASON <- as.character(c(1960:2004, 2006:2011))
## create an empy dataset that we will append to
dataset <- data.frame()
## loop over the seasons, use the function to grab the data
## and build the dataset
for (S in SEASON) {
require(plyr)
temp <- GrabSkaters(S)
dataset <- rbind.fill(dataset, temp)
print(paste("Completed Season ", S, sep=""))
## pause the script so we don't kill their servers
Sys.sleep(10)
}
head(dataset); tail(dataset);
dataset <- dataset[dataset$tm != 'TOT', ]
save(dataset, file="hr skater db.Rdata")
#-----------------------------------------------------------------------
# Lets get a sense for how many players finish 30/100
#-----------------------------------------------------------------------
# Let's look at a scatterplot of the data
png("Scatter Plot.png")
plot(x=dataset$pim, y=dataset$g, type="p", pch=20, cex=.9,
xlab="Penalty Mins", ylab="Goals Scored", bty="n",
main="Correlate Goals Scored and Penalty Minutes")
abline(h=30, lty=2, col="red"); abline(v=100, lty=2, col="red");
dev.off()
# raw output
dstemp <- transform(dataset, ind30100 = as.numeric(g >=30 & pim >=100))
str(dstemp)
with(dstemp, addmargins(table(season, ind30100)))
with(dstemp, addmargins(table(season, ind30100)))
# % of players that meet criteria per season
library(sqldf)
(pct <- sqldf("SELECT season, AVG(ind30100) as pct FROM dstemp GROUP BY season"))
# basic plot of the data
barplot(pct$pct, names.arg=pct$season)
# trying to learn ggplot
library(ggplot2)
ds <- dataset[dstemp$g >=30 & dstemp$pim >=100, c("season", "player")]
min(as.numeric(ds$season)); as.numeric(max(ds$season));
p <- ggplot(ds, aes(as.numeric(season))) + geom_bar(fill="black", colour="white", binwidth=1)
p <- p + labs(x="Season", y="# Players")
ggsave(p, file="Distribution.png")
# look at a listing of the data again
View(pct)
# Who are the players in the 2011 season that meet this criteria
dataset[dataset$season == "2011" & dataset$g >= 30 & dataset$pim >=100, c("player", "g", "pim")]
#######################################################################################
# Function to scrape season skater statistics from Hockey-reference.com
#######################################################################################
GrabSkaters <- function(S) {
# The function takes parameter S which is a string and represents the Season
# Returns: data frame
require(XML)
## create the URL
URL <- paste("http://www.hockey-reference.com/leagues/NHL_",
S, "_skaters.html", sep="")
## grab the page -- the table is parsed nicely
tables <- readHTMLTable(URL)
ds.skaters <- tables$stats
## determine if the HTML table was well formed (column names are the first record)
## can either read in directly or need to force column names
## and
## I don't like dealing with factors if I don't have to
## and I prefer lower case
for(i in 1:ncol(ds.skaters)) {
ds.skaters[,i] <- as.character(ds.skaters[,i])
names(ds.skaters) <- tolower(colnames(ds.skaters))
}
## fix a couple of the column names
colnames(ds.skaters)
## names(ds.skaters)[10] <- "plusmin"
names(ds.skaters)[11] <- "plusmin"
names(ds.skaters)[18] <- "spct"
## finally fix the columns - NAs forced by coercion warnings
for(i in c(1, 3, 6:18)) {
ds.skaters[,i] <- as.numeric(ds.skaters[, i])
}
## convert toi to seconds, and seconds/game
## ds.skaters$seconds <- (ds.skaters$toi*60)/ds.skaters$gp
## remove the header and totals row
ds.skaters <- ds.skaters[!is.na(ds.skaters$rk), ]
## ds.skaters <- ds.skaters[ds.skaters$tm != "TOT", ]
## add the year
ds.skaters$season <- S
## return the dataframe
return(ds.skaters)
}
view raw SkaterStats.R hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Data Twirling » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)