Baseball, T-tests and statistical surprises
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Are MLB players better hitters now than they were 20 years ago? Revolution Analytics' Joseph Rickert uses R to take a look at the data, and offers an instructive lesson in checking your assumptions for statistical tests in the process — Ed.
Data are everywhere – but, even for simple things, I still seem to spend a too much time surfing the web to find an appropriate data set. Many times a data set will come my way for some reason and then end up being interesting for some entirely different reason. If I only had a good way to cross reference all of these data sets: you know, a list of every data set I ever came across annotated with what it is good for and with a link to where I put it. Of course, if I had this list I would have to keep it in my list of all my other lists; and then, to make sure I could remember where it was, I would have to keep it in a special place and ….
Anyway, baseball season is here and few things in the world do data better than baseball. From opening day (3/31/11) to the end of October the ballparks of this country will generate more data than I can keep up with. Fortunately, there are others who can. Take a look, for example, at Baseball Prospectus for comprehensive baseball statistics organized by year in csv files that are easy to download or at Sean Lahman’s website for a comprehensive data base going back to 1871.
Given that today is the first day of the baseball season, I have the perfect excuse and data set to illustrates how important it is to check the assumptions before doing a t-test. Let's look at the batting averages (AVG) for both major leagues for the years 1990 and 2010. Except for the apparent increase in variability for 2010, the box plots for the two distributions look pretty similar.
So it might seem reasonable to do a simple t-test to see if there is any significant difference. In R this is one line of code that produces the result:
> t.test(AVG ~ YEAR,data=bdat,var.equal=T) Two Sample t-test data: AVG by YEAR t = 1.4098, df = 1682, p-value = 0.1588 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.003395079 0.020751550 sample estimates: mean in group 1990 mean in group 2010 0.2034081 0.1947299
Because the confidence interval contains 0 there is no reason to reject the null hypothesis that the means of the two distributions are indeed the same; so it appears that both there is no change in batting average between 1990 and last season. However, is the t-test the right test to use? Forget about the fact that I blew off the apparent increase in variability as irrelevant; are the two distributions even approximately Normal?
Good thing I checked! The kernel density plots indicate that these distributions have two, or maybe three, modes – not even close to Normal! These plots are pretty interesting on their own: when I watch hitters this season I’ll we wondering under which bump they belong. But getting back to a formal test to look for a difference in the means of the batting averages for the 1990 and 2010 seasons, it appears that the Wilcoxon Rank Sum Test (also known as the Mann-Whitney test, and which doesn't assume the distributions are Normal) is the way to go.
> wilcox.test(AVG ~ YEAR, data=bdat,conf.int=TRUE) Wilcoxon rank sum test with continuity correction data: AVG by YEAR W = 370070.5, p-value = 0.03541 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: 6.849543e-06 1.302933e-02 sample estimates: difference in location 0.004991099
The Willcoxon indicates that there is a significance difference, at the 5% level anyway. (The R code for the charts and analysis appears after the jump.) This was a surprise to me; maybe I’m easily surprised but isn’t life more fun that way. I am looking forward to quite a few surprises from 2011 Baseball. Enjoy the season!
################################################################## # Get data and build some data frames dataDir <- "C:/Users/Joseph/Documents/Revolution/Baseball" fn1990 <- file.path(dataDir,"bpstats_1990.csv") df1990 <- read.csv(fn1990) AVG1990 <- df1990$AVG fn2010 <- file.path(dataDir,"bpstats_2010.csv") df2010 <- read.csv(fn2010) AVG2010 <- df2010$AVG ################################################################## # make a "log form" data frame to generate the boxplot year1990 <-rep("1990",length(AVG1990)) year2010 <-rep("2010",length(AVG2010)) df1 <- data.frame(AVG1990,year1990) names(df1) <- c("AVG","YEAR") df2 <- data.frame(AVG2010,year2010) names(df2) <- c("AVG","YEAR") bdat <- rbind(df1,df2) boxplot(AVG ~ YEAR,data=bdat, col= c("red","blue"), main="Batting Average Distributions") #################################################################### # Draw the kernel density plots par(mfrow=c(2,1)) plot(density(AVG1990),col="red",main="1990 Batting Average Density") rug(AVG1990,col="green") plot(density(AVG2010),col="blue",main="2010 Batting Average Density") rug(AVG1990,col="green") ##################################################################### # perform the tests t.test(AVG ~ YEAR,data=bdat,var.equal=TRUE) wilcox.test(AVG ~ YEAR, data=bdat,conf.int=TRUE) #####################################################################
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.