MLB Baseball Pitching Matchups ~ manipulating pitch f/x data using the RMySQL package in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After downloading some pitch f/x data using my R script, we can finally have some fun.
But because the pitch f/x data is very elaborate, R can easily get overwhelmed by copying the dataset back and forth in memory, as you manipulate the data. So the natural progression is to use relational database systems.
Here, I show a simple example of using the RMySQL package to plot the pitch types of Tim Lincecum’s 2010 season, extracted from the source pitch f/x data from MLB Gameday. So first, I downloaded the 2010 data using the following script (with a weird error, I was only able to download till 5/4/10, but I can easily concatenate the dataset by running the script again for later dates):
DownloadPitchFX(fileloc = "./pitch2010Apr04May04.txt", start.date = "2010-04-04", end.date = "2010-05-04")
Then, I start MySQL from the BASH shell to create a database called “pitchfx” within the MySQL shell (this on Mac OS 10.6, not really familiar w/ Windows).
bash-3.2$ mysql -u root mysql> create database pitchfx;
Then we connect to the MySQL database (“pitchfx”) from R from root. From there, I can write the text file to the MySQL database by specifying a table (“2010Apr04May04″) and the file location (“./pitch2010Apr04May04.txt”). It’s important to know dbWriteTable() can include certain input arguments from read.table() so you can specify how the data will be translated into a MySQL database.
library(RMySQL) con <- dbConnect(dbDriver("MySQL"), dbname = "pitchfx", user = "root") dbWriteTable(con, "2010Apr04May04", "./pitch2010Apr04May04.txt", overwrite = F, sep = "t", head = T)
We can get the pitcher code for Tim Lincecum by looking online to then call a SQL query to get all the pitch info from Lincecum (hopefully I will have script to get pitcher/batter codes shortly).
pitcher.Lince <- dbGetQuery(con, "SELECT * FROM 2010Apr04May04 WHERE pitcher = "453311"") > head(pitcher.Lince) Year Month Day Inning Home Away b s o batter 1 2010 4 5 1 Houston Astros San Francisco Giants 0 0 1 456422 2 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 3 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 4 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 5 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 6 2010 4 5 1 Houston Astros San Francisco Giants 2 3 2 430565 pitcher stand p_throws event des type x y 1 453311 L R Groundout In play, out(s) X 115.88 138.15 2 453311 L R Strikeout Ball B 142.49 120.88 3 453311 L R Strikeout Called Strike S 121.89 156.29 4 453311 L R Strikeout Ball B 144.21 121.75 5 453311 L R Strikeout Swinging Strike S 136.48 148.51 6 453311 L R Strikeout Swinging Strike (Blocked) S 84.98 188.23 start_speed end_speed sz_top sz_bot pfx_x pfx_z px pz x0 y0 1 93.1 84.2 3.2 1.5 -7.77 6.53 -0.601 2.845 -0.913 50.0 2 92.2 83.1 3.06 1.38 -5.39 11.07 -1.235 3.635 -0.738 50.0 3 93.3 84.8 2.94 1.38 -1.83 10.63 -0.674 1.873 -1.006 50.0 4 75.6 68.4 2.95 1.27 5.77 -10.3 -1.347 3.501 -0.605 50.0 5 93.6 85.1 3.26 1.63 -5.67 9.8 -1.206 2.215 -0.823 50.0 6 86.2 79.8 3.26 1.63 -5.05 0.56 0.444 0.739 -0.861 50.0 z0 vx0 vy0 vz0 ax ay az break_y break_angle 1 6.056 3.516 -136.487 -4.813 -14.322 36.387 -20.055 23.7 32.2 2 6.084 0.531 -135.161 -4.183 -9.735 35.878 -12.1 23.7 36.0 3 6.014 1.522 -136.494 -8.778 -3.397 34.418 -12.366 23.7 11.7 4 6.394 -3.209 -110.727 4.07 6.939 25.572 -44.48 23.7 -8.2 5 5.974 0.946 -137.032 -7.542 -10.617 34.766 -13.754 23.7 33.6 6 5.784 4.875 -126.238 -6.276 -8.113 26.87 -31.195 23.8 12.5 break_length pitch_type type_confidence 1 5.6 FT .920 2 3.7 FF .625 3 3.3 FF .909 4 15.0 CU .905 5 4.1 FT .860 6 8.4 CH .902
It’s somewhat of a pain to go through extracting info from the data frame, but here it is:
# now separate out the dates starts.Lince <- paste(pitcher.Lince$Month, pitcher.Lince$Day) starts.Lince <- factor(starts.Lince, levels = unique(starts.Lince)) # grab number of pitches wrt type wrt start pitch_type.Lince <- by(pitcher.Lince, starts.Lince, function(x) by(x, x$pitch_type, function(x) length(x$pitch_type))) # construct zero matrix with proper dims, pitch type rows, dates columns p1 <- matrix(0, nrow = length(unique(unlist(lapply(pitch_type.Lince, names)))), ncol = length(pitch_type.Lince), dimnames = list(unique(unlist(lapply(pitch_type.Lince, names))), names(pitch_type.Lince))) # count number of pitches thrown wrt pitch type for each start p1.num <- lapply(pitch_type.Lince, function(x) as.matrix(x, colnames = names(x))) # input uneven named list into matrix for (i in 1:length(p1.num)) p1[row.names(p1.num[[i]]), i] <- p1.num[[i]] # graph it! barplot(apply(p1, 2, function(x) x / sum(x)), legend = rownames(p1), main = "Tim Lincecum 2010 Pitch Type Distribution")
Where “FF” is the four seam fastball, “FT” the two seam fastball and “FA” when Gameday can’t figure out which.
Filed under: Baseball, R, RMySQL
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.