Site icon R-bloggers

Pro Football Data

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve made the acquaintance of a group of data analysts here in the triangle and have agreed to arrange an outing to the Durham Bulls minor league baseball team. Because it’s for stat nerds and because I was curious, I went looking for some baseball data to analyze. I found loads of it here, but soon got distracted by the presence of NFL statistics. The season is already well underway, but I thought it might be fun to try and build a predictive model for the sport.

The first step is to get some data. Here, I use an R function to pull HTML tables from the site.

GetGamesHistory = function(FirstYear = 1985, LastYear = 2011)
{
  games.URL.stem = "http://www.pro-football-reference.com/years/"

  for (year in FirstYear:LastYear)
  {
    URL = paste(games.URL.stem, year, "/games.htm", sep="")

    games = readHTMLTable(URL)

    dfThisSeason = games[[1]]

    # Clean up the df
    dfThisSeason = subset(dfThisSeason, Week!="Week")
    dfThisSeason = subset(dfThisSeason, Week!="")
    dfThisSeason$Date = as.character(dfThisSeason$Date)
    dfThisSeason$GameDate = mdy(paste(dfThisSeason$Date, year))

    year(dfThisSeason$GameDate) = with(dfThisSeason, ifelse(month(GameDate) <=6, year(GameDate)+1, year(GameDate)))

    if (year == FirstYear)
    {
      dfAllSeasons = dfThisSeason
    } else {
      dfAllSeasons = rbind(dfAllSeasons, dfThisSeason)
    }

  }

  dfAllSeasons = dfAllSeasons[,c(14, 1, 5, 7, 8, 9)]

  colnames(dfAllSeasons) = c("GameDate", "Week", "Winner", "Loser", "WinnerPoints", "LoserPoints")

  dfAllSeasons$Winner = as.character(dfAllSeasons$Winner)
  dfAllSeasons$Loser = as.character(dfAllSeasons$Loser)
  dfAllSeasons$WinnerPoints = as.integer(as.character(dfAllSeasons$WinnerPoints))
  dfAllSeasons$LoserPoints = as.integer(as.character(dfAllSeasons$LoserPoints))
  dfAllSeasons$ScoreDifference = dfAllSeasons$WinnerPoints - dfAllSeasons$LoserPoints

  dfAllSeasons = subset(dfAllSeasons, !is.na(ScoreDifference))

  return (dfAllSeasons)

}

So I wrote this code about a week ago and already I can see that I don’t like it. For one, I try to avoid using loops in R unless absolutely necessary. Often, I’ll start out with one just to get going, but usually I find that they can be replaced with one of the apply functions or something similarly succinct. Two, I need to better understand the behavior of the readHTML function. I remember having gone a couple rounds with the points data, which is read in as a factor. This leads to the extremely ugly bit of code where I convert it to a character and then to an integer. If anyone has a better way, I’m all ears. Three, I need to revisit the basic idea of extracting columns by name. Extraction by number is dangerous and confusing. Finally, I’d like to revise the data cleansing so that it lists the game with home, visitor and winner listed. That would make it easier to test whether or not a home field advantage exists.

All that understood, the code works and gives me piles of data. How I look at it will be the subject of the next post.


To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.