[This article was first published on The Prince of Slides, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While I had mentioned in my last post that I will cover logistic regression in my next post, I decided that a quick interlude in working with bubble plots would be fun. Bubble plots have become pretty popular recently, especially with all of the Visualization Challenges I’ve seen around the internet (by the way, I think people in the sabermetric world have a great chance to win some of these, despite the fact that they’re generally not baseball data).Ultimately , bubble plots are a good way to present a third dimension on a graph.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today, I’ll talk about doing some basic bubble plots using some Red Sox and Yankees data on attendance and wins over time (click here for the “soxyanks.csv” data link). If you remember my quick tutorial on plotting time series data, I showed how to track wins and attendance over time. However, we often want to include the most information possible on our plots, and that often means presenting a third (or fourth) variable. This makes the 2-dimensional world of plotting more challenging, and that is where bubbles come in (Side Note: It is also why heat maps are so extensively used for Pitch F/X data!).
Okay, so what do the bubbles tell us? Generally, the size of the bubble is meant to represent that third dimension. For wins and attendance over time, it’s not straight forward to track these on the same plot. You could normalize them so that they’re on the same scale and then plot them together, but this is a difficult comparison over time when something like attendance is growing. Of course, this is a common time series issue that I’m not going to get into on this site in which you could take a first difference approach or some other more complicated model, do some smoothing, go into the frequency domain, and so on. But you don’t want to hear about unit roots, random walks, and the like. You’re here for baseball and fun…right?. If you normalize the two variables–just using standard z-scores–you’ll end up with something like this:
##set working directory and load data
setwd(“c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics”)
##load data
ball < - read.csv(file="soxyanks.csv", h=T)
head(ball)
##attendance time plot
plot(ball$yank.att ~ ball$year, xlab=”Year”, ylab=”Yankees vs. Red Sox Attendance”,main=”Average Attendance Per Game”, col=”darkblue”, type=”l”, lwd=3)
lines(ball$bos.att ~ ball$year, col=”darkred”, lwd=3)
legend(1900, 54000, legend=c(“Yankee”, “Red Sox”), fill=c(“darkblue”, “darkred”))
Pretty easy to see the general trend in attendance over time, with the usual spikes. However, this doesn’t give us much information about the wins of each team over time. We could make a separate plot to compare wins over time for each team. Or, we can represent this new dimension using bubbles at each time point, where the size of the bubble represents the winning percentage of each team in each year.
There are a number of ways to do this in R, and I’ll begin with a simple one: simply using the command “cex=” to indicate point size based on some variable. There are some shortcomings with this method, but I’ll talk about that later. Beginning with just the Yankees, let’s plot some points in addition to our lines (keep in mind this is a starter point–this plot will be ugly):
##plot yankees attendance and wins using “cex=”
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col=”darkgrey”, main=”Yankees Wins & Attendance Over Time”, xlab=”Year”, ylab=”Average Game Attendance”)
lines(yank.att ~ year, data=ball, lwd=2, col=”darkblue”)
legend(1900, 54000, legend=c(“.250 W%”, “.350 W%”, “.450 W%”, “.550 W%”, “.650 W%”, “.750 W%”), col=”darkgrey”, pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty=”n”)
The legend in the above plot is a bit complicated and is unfortunately the best I can do with this code. Later in this post, I’ll show another way to do these based on some code in this tutorial. Honestly, I think my legend is a bit ugly and I’m pretty sure that the “ggplot2” package has a better way. Also notice that I use a polynomial to scale the bubbles. Normally, I wouldn’t recommend doing this; however, because of the small range of win percents, this tends to give more useful size ranges for plotting. If you want to do a simple linear transformation, you can multiply the win percents by a constant instead…or use wins (which is problematic since teams have not played the same number of games for the entire time period). The reason this can become a problem is that we want the bubbles to have proportional area based on the win percent. I’ll talk about this in a few paragraphs below, but will first talk about some color issues.
Unfortunately, the bubbles all mesh together in the plot. And I’ll use this portion of the tutorial as a lesson in the RGB color scale in R, along with how to work with transparent colors. The RGB scale stands for Red-Green-Blue. It’s just like that guy with the insanely deep voice talking about the new Sharp televisions (except they add yellow). So, while we can use the names of colors (and just general numbers for colors), we can also use the RGB scale to make our own colors.
I’ll just start with a simple way to work with the color scale. When using this scale, you will need to input an 8-digit number in the form of:
col=”#00000000″
The first two digits will tell how much Red to put into the color (on a 00 to 99 scale). The second two digits do the same for Green, and the third pair of digits do this for Blue. Finally, the last pair of numbers will tell R how transparent you want your color to be. For lots of transparency, you set this number low. For less transparency, you set it high. We can use this to our advantage in the bubble plots so that we can see the outline of each bubble if they overlap. So let’s rework the Yankee plot above, but make some transparent colors:
##now do Yankee plot with transparent colors
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col=”#99999950″, main=”Yankees Wins & Attendance Over Time”, xlab=”Year”, ylab=”Average Game Attendance”)
lines(yank.att ~ year, data=ball, lwd=2, col=”darkblue”)
legend(1900, 54000, legend=c(“.250 W%”, “.350 W%”, “.450 W%”, “.550 W%”, “.650 W%”, “.750 W%”), col=”#99999950″, pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty=”n”)
This looks a little better, as you can see the outline of each bubble in the overlapping portions with other bubbles. You can see that the Yankees had a rough decade in the 1970’s in both attendance and winning. Their attendance seemed to drop below what a normal trend would suggest in these years, and there seems to be a good chance that this was due to their sub-par on-field performance (remember, we’re just speculating here). By this, you can see some advantage to including bubbles for this type of data.
Now, let’s go ahead and add the Red Sox data to this plot. I altered the key just a little bit, but still not to my liking:
##have both Yankees and Red Sox on same plot
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col=”#99999950″, main=”Yankees vs. Red Sox Wins and Attendance Over Time”, xlab=”Year”, ylab=”Average Game Attendance”)
points(bos.att ~ year, data=ball, pch=16, cex=20*bos.win^3, col=”#99000050″)
lines(yank.att ~ year, lwd=2, col=”darkblue”, data=ball)
lines(bos.att ~ year, lwd=2, col=”darkred”, data=ball)
legend(1900, 54000, legend=c(“.250 W%”, “.350 W%”, “.450 W%”, “.550 W%”, “.650 W%”, “.750 W%”), col=”#99999950″, pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=2, bty=”n”)
Here we can see the demise of the Red Sox in the 1920’s, as their performance was so bad we can barely see their win bubbles. Red Sox attendance was low at those points, and we see this happen again not long after the WWII attendance bump. Then, when the Yankees start sucking in the 70’s, we see the Red Sox attendance rebound a bit as the team improves a little. See how the bubbles help us to tell a story over time.
It’s always important to think about the shortcomings of these plots. Obviously, the bubbles are not growing in a linear fashion, and this can be misleading in some cases. In addition, things are a bit crowded. That’s not even mentioning that some bubbles tend to be too small, while others are too large. These aren’t the prettiest plots in the world, but they’re a decent start. I encourage you to try out different data and ways of working with the bubbles on your own.
So, let’s switch gears now to some other types of data along with another method of creating bubble plots.
Perhaps we’re interested in team home runs, stolen bases, and walks on the same plot. In other words, let’s see which teams are more like Adam Dunn and which are more like Juan Pierre First, go ahead and load in the “teamsdata.csv” file from a previous tutorial.
##read in new data
teams < - read.csv(file="teamsdad.csv", h=T) head(teams)
For this portion of the tutorial, I’ll be using the “symbols()” function, which plots shapes with borders in a plot. Asthetically, these are prettier. But we’ll have to think about a few things before we begin to plot. I am going to take these explanations directly from this fantastic tutorial.
The first thing we’ll have to think about is “how are the sizes of the shapes determined”. By using “symbols()” to draw circles, we are creating shapes using the radius. But we may want the area of the circle to represent the third variable, rather than radius. Additionally, we’ll want to think about which variables we want on the axes, and which on should be used for the size of the bubbles. Sometimes this requires some playing around in R until you get your favorite visual. I go ahead and use stolen bases as the sizer for the bubbles and convert my sizes to area because it has a large range and bubble sizes vary nicely across teams.
##make use of area instead of radius for sizing
teams$radius < - sqrt(teams$SB/pi) head(teams) ##try doing this a different way using team hitting data symbols(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], circles=teams$radius[teams$Year==2010], inches=0.5, fg="darkblue", bg="#99000070", main="Team Home Runs, Walks & Stolen Bases", xlab="Home Runs", ylab="Walks") text(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], teams$Tm[teams$Year==2010], cex=0.6, col="black")
Now, the above code needs some explanation. Obviously, I only use teams from 2010. Within the symbols function, we first type what we want on the x-axis, followed by the y-axis just as in standard, non-equation plot code format. Then we specify which symbol we want to use by the command “circles=“, followed by telling R to size the circles by the radius we created from our SB. Using this, the area of the circles will be equal to the number of stolen bases by the team.
The “inches=” argument simply tells R what the baseline size should be for the circles. Increasing it will make them larger, decreasing it will make them smaller. “fg=” and “bg=” tell R which colors we want the circles to be filled with and bordered with, respectively. I use the RGB scheme to make a transparent red color for filling the circles.
Unfortunately, the “pt.cex=” command in the “legend()” function does not size the points in the same way that symbols does. Similarly, I’m having some trouble creating a legend just plotting single circles in the top right of the plot (the scale is all off). If anyone has any suggestions, let me know. I’d love to hear it.
Qualitatively, though, we can see the outliers in the data. Tampa Bay doesn’t have a lot of power (though, a bit above average), but runs a lot and walks a lot. Toronto on the other hand doesn’t really steal, doesn’t really walk, but mashed a bunch of HR in 2010. The Mariners and Astros were relatively useless in the power and walk department.
As usual, I have the R-code for this post below:
############################# ################Sidetrack: Bubble Plots and Transparent Colors (RGB Scale) ############################# ##set working directory and load data setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics") ball <- read.csv(file="soxyanks.csv", h=T) head(ball) ##"boring plot" ball$yank.att.z <- (ball$yank.att - mean(ball$yank.att))/sd(ball$yank.att) ball$yank.win.z <- (ball$yank.win - mean(ball$yank.win))/sd(ball$yank.win) head(ball) png(file="boringplot.png", height=500, width=650) plot(ball$yank.att.z ~ ball$year, xlab="Year", ylab="Normalized Win Percent & Attendance", main="Yankee Wins and Attendance Across Time", col="darkblue", type="l", lwd=3, ylim=c(-3,3)) lines(ball$yank.win.z ~ ball$year, col="darkgray", lwd=3) legend(1900, 3, legend=c("Yankee Attendance", "Yankee Wins"), fill=c("darkblue", "darkgrey")) dev.off() ##attendance time plot png(file="simpletime.png", height=500, width=650) plot(ball$yank.att ~ ball$year, xlab="Year", ylab="Yankees vs. Red Sox Attendance", main="Average Attendance Per Game", col="darkblue", type="l", lwd=3) lines(ball$bos.att ~ ball$year, col="darkred", lwd=3) legend(1900, 54000, legend=c("Yankee", "Red Sox"), fill=c("darkblue", "darkred")) dev.off() ##yankees only bubble plot png(file="yanksonly.png", height=650, width=1000) plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="darkgrey", main="Yankees Wins & Attendance Over Time", xlab="Year", ylab="Average Game Attendance") lines(yank.att ~ year, data=ball, lwd=2, col="darkblue") legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="darkgrey", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n") dev.off() ##now do Yankee plot with transparent colorspng(file="yankees.png", height=500, width=650) png(file="yankstransparent.png", height=650, width=1000) plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees Wins & Attendance Over Time", xlab="Year", ylab="Average Game Attendance") lines(yank.att ~ year, data=ball, lwd=2, col="darkblue") legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n") dev.off() ##create bubble plot comparing yankees and red sox png(file="YanksSoxBubble.png", height=650, width=1000) plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees vs. Red Sox Wins and Attendance Over Time", xlab="Year", ylab="Average Game Attendance") points(bos.att ~ year, data=ball, pch=16, cex=20*bos.win^3, col="#99000050") lines(yank.att ~ year, lwd=2, col="darkblue", data=ball) lines(bos.att ~ year, lwd=2, col="darkred", data=ball) legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=1.6, bty="n") dev.off() #####Now use team hitting data ##make use of area instead of radius for sizing teams$radius <- sqrt(teams$SB/pi) head(teams) ##try doing this a different way using team hitting data png(file="teamhittingbubble.png", height=650, width=1000) symbols(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], circles=teams$radius[teams$Year==2010], inches=0.5, fg="darkblue", bg="#99000070", main="Team Home Runs, Walks & Stolen Bases", xlab="Home Runs", ylab="Walks") text(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], teams$Tm[teams$Year==2010], cex=0.6, col="black") dev.off()
To leave a comment for the author, please follow the link and comment on their blog: The Prince of Slides.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.