Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a recent post I walked through baseball’s Pythagorean theorem in Excel. I am finding baseball statistics a great post subject for a variety of reasons: it’s freely available, it can be analyzed in both R and Excel, and it’s something a little out the ordinary that allows us to think creatively about data.
We will again make use of the fantastic Lahman dataset. Unzip the folder and leave a copy of teams.csv somewhere convenient on your machine.
In this post we will calculate how many games behind in the standings each team was at the end of the season for its division since 2010. This will require finding a “max if” value (What is the highest number of games won for any team in each division and year?). We will do this in both R and Excel.
In Excel:
In Excel we will use Get & Transform to read in the csv file, then use an array formula to calculate the “max if.”
1. Read in teams.csv
From the Home Ribbon, go to Data – New Query – From File – From CSV. Then locate and select your teams.csv file.
2. Filter data before 2010
Clicking on the filter button on the yearID column, select rows with a year greater than 2009 (or greater than or equal to 2010).
3. Remove other columns
There are a LOT of extra columns in this dataset. To remove some, click on the yearID column header. Then hold down your control key and click on each header title until you get to “L.” Right-click and select “Remove Other Columns.”
4. Load the table
We are ready to analyze this data. At the top of your Ribbon select “Close and Load.”
5. Create a concatenated field
We want to find how many games behind each team was in its division for this year. To do this, we need to create a unique identifier for each year & division. To do this, we will concatenate yearID, teamID and divID using the CONCATENATE function. You may as well sort it A-Z too:
This new field should be somewhat intuitive. The first group is the 2010 AL Central, followed by the 2010 AL East and so forth.
6. Find the maximum number of wins in each grouping
Here we will use a MAX IF array formula to find the maximum number of wins in each year/division grouping:
=MAX(IF([Concat]=[@Concat],[W]))
Rather than selecting Enter at the end of this formula, select Ctrl + Shift + Enter.
(If this is entirely new territory for you, I suggest Mike Girvin’s book Ctrl+Shift+Enter: Mastering Excel Array Formulas to learn more.)
From there, you can subtract the number of wins from each team by this number to find how many games they were behind in their division at the end of the season. Add a column, GB, that calculates the difference between MaxWins and W.
In R:
To do this in R we will be making use of the dplyr package.
1. Read in teams.csv
#call in dplyr library library(dplyr) #read in teams.csv file teams <- read.csv("C:/RFiles/teams.csv")
2. Filter data before 2010
teams <- filter(teams, yearID >= 2010)
This is done with dplyr’s filter() function.
3. Remove unnecessary columns
teams <- select(teams, yearID:L)
We now use dplyr’s select() function to again remove all the columns after “L.” With this function we use the “:” figure to tell R that we want all columns in between yearID and L.
3. Group data by Year/League/Division
teams <- group_by(teams, yearID, lgID, divID)
In Excel we used the CONCATENATE function to build a unique identifier for each group.
In dplyr, we use the group_by() function to literally group our data based on the variables yearID, lgID and divID. This way we can calculate summary statistics for each group.
4. Find the maximum # of games won for each Year/League/Division group
#create a table with the max # of wins for each League/Div/Year grouping teamsmax <- summarise(teams, divfirstplace = max(W)) #bring this max # into a column of our original dataframe teams <- merge(teams, teamsmax) #calculate the # of games behind teams$GB <- teams$divfirstplace - teams$W
Now we use the summarise() function to find the maximum number of wins for each group, using the base R max() function. The teamsmax data frame contains the maximum number of wins for each group. We then merge this back into our dataset so that the new variable, divfirstplace gives us the maximum number of games won for each group.
From here we can calculate the number of games behind, GB.
Guess what? In this exercise, we arrived at the same answers in R and Excel. Just to check, let’s export our R file and inspect the sum of the GB column in this R file versus our Excel file.
write.csv(teams,"C:/RFiles/teamsR.csv")
Home run! Both come to 3,363 (using the 2010-2017 seasons).
Below is the complete code.
#call in dplyr library library(dplyr) #read in teams.csv file teams <- read.csv("C:/RFiles/teams.csv") #filter greater than 2010 teams <- filter(teams, yearID >= 2010) #select columns from yearID to L teams <- select(teams, yearID:L) #group by year/league/division teams <- group_by(teams, yearID, lgID, divID) #create a table with the max # of wins for each League/Div/Year grouping teamsmax <- summarise(teams, divfirstplace = max(W)) #bring this max # into a column of our original dataframe teams <- merge(teams, teamsmax) #calculate the # of games behind teams$GB <- teams$divfirstplace - teams$W #write to csv write.csv(teams,"C:/RFiles/teamsR.csv")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.