sab-R-metrics: Intermediate Scatter Plots
[This article was first published on The Prince of Slides, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First off, I’ll say it’s been a whirlwind of a past few days. Thanks to David Smith at the Revolutions Blog for his kind words about the sab-R-metrics series and link back this way. Add in Ed Kupfer’s posts at the APBRmetrics board, Harry Pavlidis at THT, Dave Allen at Fangraphs and about 30 Twitterers, I’ve seen some serious increase in site traffic. I’ve gotten a lot of great feedback on the blog and through email and I appreciate all of those who read this.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last time, I left you with some code for creating box plots and histograms using the Shaun Marcum Pitch F/X data. For this post, I’ll be using the same bunch of data sub-setted to include only those data with pitch speed/location information available. For those that have missed the last 5 posts or need to go get the data, the link below will take you to all of them:
sab-R-metrics Series
I’ll try to use only functions and commands that were used in previous posts, so if you’re not sure about something here, check the previous posts out. If you can’t find it, feel free to comment to shoot me an email.
Go ahead and open up your data file and subset the data like the following (of course, calling from your OWN directory):
#set working directory, load data, and subset data
setwd(“c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics”)
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)
subset(marcum, marcum$start_speed > 0)
head(marcfx)
The above should be exactly what we worked with last time. However, for today’s purposes, I am going to clean the data up just a little bit more to only include Marcum’s 5 main classified pitchs: Four-Seam, Cutter, Change, Curveball and Two-Seam. This will make it a little easier to work with color in the plots later on. I’m going to just name this “shaun” to ensure there isn’t any re-assigning of objects in R. To do this, you can use the following rough code (an extension from the ‘sub-setting’ tutorial):
#cleaning up the data
fourf <- subset(marcfx, marcfx$pitch_type=="FF") #use ‘four’ because ff is a function in R
fc <- subset(marcfx, marcfx$pitch_type=="FC")
ft <- subset(marcfx, marcfx$pitch_type=="FT")
cu <- subset(marcfx, marcfx$pitch_type=="CU")
ch <- subset(marcfx, marcfx$pitch_type=="CH")
shaun_b <- as.data.frame(rbind(fourf, fc, ft, cu, ch))
head(shaun_b)
However, there is a little quicker way to do this (which I just discovered, believe it or not!). I’ve always had trouble with “or” statements in R. For “and” statements, you can just use “&” in many cases. However, the solution for “or” isn’t as straight forward. I finally found a solution for myself here (one of the reasons I wanted to do these tutorials in the first place: so I can focus myself on learning some basic functions I may have missed in my intro stats with R class). I use “%in%” to indicate to R that I want to grab only those rows for which the pitch type is the ones listed above:
#one-line way to subset for these conditions
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]
ADDENDUM:
#another way to do ‘or’
shaun_c<- subset(marcfx, marcfx$pitch_type=="CH" | marcfx$pitch_type=="CU" | marcfx$pitch_type==”FF” | marcfx$pitch_type==”FC” | marcfx$pitch_type==”FT”)
Okay, now we’re all set. Let’s begin by reproducing the scatter plot we made for the Albert Pujols data, showing the ending speed on the y-axis and the starting speed on the x-axis. Don’t forget to include axis labels and a title.
#plotting end speed as a function of starting speed
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”)
As usual, the basic function is a little boring. Of course, the plot isn’t particularly useful either. It simply tells us that the faster the ball comes out of the pitcher’s hand, the faster it will be traveling when it crosses the plate. For now, that is okay with me. We see a nice linear relationship in the data. However, we can add a little more information to this plot by using color.
Sometimes you can use color to make things a little more exciting, but we need to be careful in this situation (thank you to David Smith for linking back using this horrid example of what NOT to do). Last time, I used the Brewers’ colors for boxplots and histograms to brighten things up, but it was also useful in another way that I failed to mention. How? Well, let’s say we have two box plots side-by-side. Plot titles should always be there, but if we’re comparing Josh Beckett (Red Sox) and Shaun Marcum (Brewers) we can make the discrepancy more apparent and signal who is who with color. By filling the boxes with red and blue, respectively, it’s easier for the reader to know which one is Beckett and which is Marcum, assuming they have some knowledge about team colors.
For now, let’s continue with just making the points in our plot blue, and slowly get more advanced with the colors:
#blue scatter plot
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=”blue”)
Unfortunately, this does not convey much information for us as it may if we had it next to a red Josh Beckett plot. If you don’t like the open circle points in the graph, you can also change those with the “pch=” option. Fiddle around with different numbers if you’d like and see what you come up with. In addition, the points are a little small for my liking in the dimensions I have set up for my double-window png file. Therefore, I made them bigger using the “cex=” option. The default is 1, so by making this equal to a larger number will grow the size of your points. Below, I show larger versions of both filled circles and filled triangles:
#make a two-window scatter plot with different point types and larger sizes
par(mfrow=c(1,2))
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=”blue”, pch=19, cex=2)
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=”blue”, pch=17, cex=2)
Now let’s try to get a little more information out of our plots. One way to do this is use the options in R to indicate which type of pitch lies where on our plot. We can do this in 3 simple ways: using color, using shapes, or using text. The first and second option go well together, while using text on a plot is usually better suited for graphs that have fewer points (in order to be able to read it). Here, I’ll just show what you can do with the “text()” function in visualizations. It is often useful for labeling points of interest as well. Here, I will utilize an option in the “plot()” function that suppresses plotting the points on the graph. I highlight this in red in the code below. We can also include BOTH points and text, but with the current data that gets way too messy. All you would have to do is remove the red colored option in the code below:
#draw text instead of points
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, type=”n”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”)
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=”blue”)
As you can see, the text is a bit much here, but it might come in handy later on. We can at least see that fastballs (FF) are faster and change-ups and curves (CH/CU) are slower. It’s good to see that what we’ve done thus far at least makes sense! Now, let’s try indicating pitch types using color and/or point types (pch). Hopefully, this will help to maximize the information we can get out of this data visualization. Let’s begin with color.
For this will again use the “col=” option we learned last time; however, it will get a little more complicated. Here, we’ll need to tell R to color code by pitch type. Luckily, if we tell R that “col=shaun$pitch_type” it will know what to do. Let’s try it below. Doing things this way results in a problem that I want you to try and think about before heading to the next paragraph…
#adding color by pitch type
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=shaun$pitch_type)
We can see that the plot shows 5 different colors for each of the 5 pitches in our data set. Unfortunately, we don’t know which is which! For this, we have to understand what R is doing when we use this option for the colors.
When we assigned colors last time, we could have also used numbers rather than the names of the colors. For example, if we want everything to be red, we can use the command “col=2”. When using character strings to assigning colors, R does this in alphabetical order. Here our data set had originally included 10 pitch types: “NA”, “CH”, “CU”, “FA”, “FC”, “FF”, “FT”, “IN”, “PO”, and “SL” (to see this, simply use the code: summary(shaun$pitch_type)). Therefore, it assigned these 1 through 10. We removed all but “CH”, “CU”, “FC”, “FF”, and “FT”, which we now know are numbers 2, 3, 5, 6 and 7.
“CH” is the first in alphabetical order, and is assigned to “2”, which is red. The 3 goes to “CU” (second in the alphabetical order of pitch types for Marcum). Therefore, these are green. This goes throughout the pitches that Marcum throws (FC=5 (turquoise), FF=6 (yellow), FT=7 (pinkish-purple?)). If we want to check this, we can also create a new numeric version of our pitch type variable with the following code:
#create numeric version of pitch type
shaun$p_type_b <- as.numeric(shaun$pitch_type) head(shaun)
Now that we know this, we need to figure out what each color is numbered in the R environment. I told you above because I had already snooped around. While I’ve found color keys for R online (showing the number and color name), they don’t seem to match up with what I’ve said above. I imagine there are some R-Bloggers out there with more knowledge about the best way to decipher your colors than I. The best way to figure out which is which from what we have here is to use color AND text in your plot. When you have lots and lots of pitch types, this is tougher. But here with 5, we can do it pretty easily below.
#use text and color so we know what is what
plot(shaun$end_speed ~ shaun$start_speed, type=”n”, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=shaun$pitch_type)
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=as.numeric(shaun$pitch_type))
Okay, now that we know what is what and now that we know we don’t want to have all text on the plot, how about we use both color AND point shapes to indicate pitch types. Just like the colors, we can use the “shaun$pitch_type” vector to make different shapes on our plot. The code below will do this, with an important part of the graph missing:
#points and colors for pitch types
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
The graph above is missing one thing: A Legend! Too often I see plots that don’t tell me what each point or color actually means. Since we’ve done our own snooping around into the color and shape assignments, let’s let the readers know what the hell we’re talking about. There are a few ways to include a legend, but I prefer using the “legend()” function. I know Dave Allen generally uses colored and stacked text which is also cool. You can try doing that on your own with the “text()” function.
With the legend plot, we first want to indicate where we’ll put it on our graph. Try to choose a placement using points for the x and y axes in the first two places of the function. This can be done like this:
“legend(x,y)”
where x and y are points on your axes. Of course, we’ll need to specify some other options for something to show up. After our x-y coordinates for the legend, we want to specify what the colors and point types are telling us about the pitch types.
Below, I have the code for creating a legend based on our data and the plot using pch and col as in the other plotting functions that we have talked about. First, you need to plot your data, then use “legend()” to add it on just like we did with the “text()” function above.
#make a plot and add a legend for color and point types
plot(shaun$end_speed ~ shaun$start_speed, main=”Shaun Marcum Pitch Speed”, xlab=”Speed Out of Hand (mph)”, ylab=”Speed Crossing Plate (mph)”, col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3)) legend(68, 84, c(“Change-Up”, “Curveball”, “Cutter”, “Four-Seam”, “Two-Seam”), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7))
Unfortunately, this post has gotten rather long, and I’d prefer not to put too much into a single post. I’d recommend playing with the different colors and point types in R to find what you like best. Since this is a long post, I’ll save points, lines (and line/time plots), shapes, and custom axes for next time. At that point, maybe we can start getting into some basic statistics and smoothing for our visualizations. The code from today is posted below:
################################ ########Marcum Scatterplots and Shapes and Lines and Stuff ################################ #setting directory and opening Shaun Marcum 2010 Pitch F/X data setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics") marcum
To leave a comment for the author, please follow the link and comment on their blog: The Prince of Slides.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.