Using R to Analyze Baseball Games in “Real Time”
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post originally appeared on my WordPress blog on October 4, 2009. I present it here in its original form.
In order to honor the last day of the 2009 MLB regular season (excepting the Twins/Tigers tiebreaker Tuesday night), I was reading a book that combines a few of my favorite thing: statistics, R, and baseball. The book, by Joseph Adler, is called Baseball Hacks, and I highly recommend it if you are interested in analyzing baseball data. Joseph uses Excel for some tips, R for others, and shows you how to download historical and current baseball data for further analysis. One tip that the book offered was a way to download “real time” baseball data from MLB’s site in XML format. I decided to try to write some R functions to retrieve, summarize, and analyze what was available.
Where are the data?
Joseph shows how, at least at the time of the writing of his book and this post, you can go here to download a wealth of XML data from past and current seasons. If you drill down far enough into the directories, you can find a file called miniscoreboard.xml, which is the one I use for this analysis.
The R functions
Here are the R functions I wrote. You can copy and paste them into your R session so that they are available to you. The next section will describe how to use them. Writing these was fairly straightforward, and simply a matter of XML manipulation. I admit that there may be far better ways to do this manipulation using the XML package, but this worked for now.
################################################################################ # Program Name: xml-mlb-gameday.R # Author: Erik # Created: 10/04/2009 # # Last saved # Time-stamp: <2009-10-04 17:23:02 erik> # # Purpose: show current scoreboard in R # # ** Generated by auto-insert on 10/04/2009 at 13:25:58** ################################################################################ ## need XML package, may need to install w/ install.packages() library(XML) ## create a boxscore object from an XML description of a game createBoxScore <- function(x) { status <- if(x$.attrs["status"] != "In Progress") "Final" else if(x$.attrs["top_inning"] == "Y") "Top" else "Bot" bs <- list(status = status, inning = as.numeric(x$.attrs["inning"]), away.team = x$.attrs["away_name_abbrev"], away.runs = as.numeric(x$.attrs["away_team_runs"]), away.hits = as.numeric(x$.attrs["away_team_hits"]), away.errors = as.numeric(x$.attrs["away_team_errors"]), home.team = x$.attrs["home_name_abbrev"], home.runs = as.numeric(x$.attrs["home_team_runs"]), home.hits = as.numeric(x$.attrs["home_team_hits"]), home.errors = as.numeric(x$.attrs["home_team_errors"])) class(bs) <- "boxscore" bs } ## print the boxscore object in traditional format print.boxscore <- function(x, ...) { cat(" ", "R ", "H ", "E (", x$status, " ", x$inning, ")\n", format(x$away.team, width = 3), " ", format(x$away.runs, width = 2), " ", format(x$away.hits, width = 2), " ", x$away.errors, "\n", format(x$home.team, width = 3), " ", format(x$home.runs, width = 2), " ", format(x$home.hits, width = 2), " ", x$home.errors, "\n\n", sep = "") } ## utility function ... as.data.frame.boxscore <- function(x, row.names, optional, ...) { class(x) <- "list" as.data.frame(x) } ## This is the "user accessible" public function you should be calling! ## downloads the XML data, and prints out boxscores for games on "date" boxscore <- function(date = Sys.Date()) { if(date > Sys.Date()) stop("Cannot retrieve scores from the future.") year <- paste("year_", format(date, "%Y"), "/", sep = "") month <- paste("month_", format(date, "%m"), "/", sep = "") day <- paste("day_", format(date, "%d"), "/", sep = "") xmlFile <- paste("http://gd2.mlb.com/components/game/mlb/", year, month, day, "miniscoreboard.xml", sep = "") xmlTree <- xmlTreeParse(xmlFile, useInternalNodes = TRUE) xp <- xpathApply(xmlTree, "//game") xmlList <- lapply(xp, xmlToList) bs.list <- lapply(xmlList, createBoxScore) names(bs.list) <- paste(sapply(bs.list, "[[", "away.team"), "@", sapply(bs.list, "[[", "home.team")) bs.list }
Examples of summarizing real-time baseball data
Here is how to run some simple analyses on baseball games happening right now. This is the real value add for the idea of downloading data through R. Obviously you could just go to your favorite sports site to find scores if you wanted to know how your team was doing, but pulling the data into R lets you further analyze the data, and even combine it with other data sources (e.g., weather).
> ## print boxscores for games happening NOW! > boxscore() $`CWS @ DET` R H E (Final 9) CWS 3 7 0 DET 5 12 0 $`HOU @ NYM` R H E (Final 9) HOU 0 4 1 NYM 4 9 0 $`PIT @ CIN` R H E (Final 9) PIT 0 10 0 CIN 6 11 0 $`WSH @ ATL` R H E (Final 15) WSH 2 13 0 ATL 1 13 0 $`CLE @ BOS` R H E (Final 9) CLE 7 8 0 BOS 12 11 0 $`FLA @ PHI` R H E (Final 10) FLA 6 11 1 PHI 7 12 0 $`TOR @ BAL` R H E (Final 11) TOR 4 9 2 BAL 5 8 0 $`NYY @ TB` R H E (Final 9) NYY 10 12 0 TB 2 7 2 $`KC @ MIN` R H E (Final 9) KC 4 12 0 MIN 13 11 0 $`MIL @ STL` R H E (Final 10) MIL 9 15 2 STL 7 7 0 $`ARI @ CHC` R H E (Final 9) ARI 5 8 0 CHC 2 6 0 $`LAA @ OAK` R H E (Final 9) LAA 5 9 1 OAK 3 12 1 $`SF @ SD` R H E (Bot 9) SF 3 11 1 SD 3 4 0 $`COL @ LAD` R H E (Top 8 ) COL 1 4 1 LAD 5 12 0 $`TEX @ SEA` R H E (Final 9) TEX 3 4 0 SEA 4 8 1 > ## print boxscores for a different day's games > boxscore(date = as.Date("2009-10-01")) $`STL @ CIN` R H E (Final 9) STL 13 15 1 CIN 0 5 0 $`MIN @ DET` R H E (Final 9) MIN 8 13 4 DET 3 7 1 $`MIL @ COL` R H E (Final 9) MIL 2 6 0 COL 9 14 1 $`ARI @ SF` R H E (Final 9) ARI 3 6 1 SF 7 11 0 $`TEX @ LAA` R H E (Final 9) TEX 11 15 1 LAA 3 7 2 $`WSH @ ATL` R H E (Final 9) WSH 2 7 0 ATL 1 6 0 $`HOU @ PHI` R H E (Final 9) HOU 5 10 0 PHI 3 13 1 $`BAL @ TB` R H E (Final 9) BAL 3 7 0 TB 2 5 1 $`CLE @ BOS` R H E (Final 9) CLE 0 3 0 BOS 3 12 0 $`PIT @ CHC` R H E (Final NA) PIT NA NA NA CHC NA NA NA $`OAK @ SEA` R H E (Final 9) OAK 2 7 1 SEA 4 8 0 > ## save the boxscores for futher analysis > bs <- boxscore() > ## convert to a more useful form, a data.frame > ## with one game per row > bs.df <- do.call(rbind, lapply(bs, as.data.frame)) > ## status of today's games > table(bs.df$status) Final Bot Top 13 1 1 > ## how many innings have been played today? > sum(bs.df$inning, na.rm = TRUE) [1] 144 > ## how many runs have been scored by the home teams today? > sum(bs.df$home.runs, na.rm = TRUE) [1] 79 > ## how many runs have been scored by the away teams today? > sum(bs.df$away.runs, na.rm = TRUE) [1] 62
Conclusion
These functions are far from robust, and I think they only work for the current year (i.e., 2009, dates from 2008 were not working right). The format looks like it has changed over time, which is not surprising. I only use a very small subset of the available data, even the miniscoreboard.xml file contains far more information than I summarize here. This is really the first time I have dealt with XML data, so I am sure there is a lot more that can be done, but for a one-day project, I think the results are pretty interesting. I will definitely provide the updates I make to these functions, and may even start a baseball R package if they grow extensive enough. I suppose this is a project I can work on in the off season!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.