[This article was first published on Bayes Ball, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- tabsets --> < !-- code folding --> The new Major League Baseball season will soon begin, which means it’s time to look back and update my run scoring trends data visualization application, built using RStudio’s shiny package.
You can find the app here: https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/
The github repo for this app is https://github.com/MonkmanMH/MLBrunscoring_shiny
This update gave me the opportunity to make some cosmetic tweaks to the front end, and more consequential changes to the code under the hood.
1. retired
At one point in the app’s code, I had used the now-retired
In previous versions of the app, the filtering of rows (records or observations) and selecting of columns (variables), the calculation of each team’s average runs scored and runs allowed per game, the league average runs per game, and the comparison of the team to the league, was done first using base R’s
Then I switched to using
For this iteration, preparing the data for the app is now done using
It’s always valuable to have your end-state in mind when working through a multi-step data wrangle like this. My goal is the values shown on the “team plot” tab of the app – an index value (i.e. a percentage) of a team’s average runs scored (and runs allowed) compared to the league run scoring rate, for a single season.
Thus rather than having the app do the work of
At this point, the code
Note:
For the first approach, the sum calculation is part of the function.
Notes:
Note that we start with
Notes:
-30-
< !-- dynamically load mathjax for compatibility with self-contained -->
You can find the app here: https://monkmanmh.shinyapps.io/MLBrunscoring_shiny/
The github repo for this app is https://github.com/MonkmanMH/MLBrunscoring_shiny
This update gave me the opportunity to make some cosmetic tweaks to the front end, and more consequential changes to the code under the hood.
1. retired reshape
, using tidyr
At one point in the app’s code, I had used the now-retired reshape
package to melt a data table. Although this still works, I opted to update the code to use the gather()
function in the package tidyr
, part of the tidyverse.2. feather instead of csv
The app relied on some pre-wrangled csv files; these have been replaced by files stored using the .feather format, which makes for a signficant performance improvement.3. wrangling: the calculation of team and league run scoring averages
The goal is to create data tables that minimize the amount of processing the app has to do.In previous versions of the app, the filtering of rows (records or observations) and selecting of columns (variables), the calculation of each team’s average runs scored and runs allowed per game, the league average runs per game, and the comparison of the team to the league, was done first using base R’s
apply
family of functions.Then I switched to using
dplyr
, and although the steps were now in a pipe, this approach still required creating a separate data table with the league average, and then joining that table back into the main team table so that the team result could be compared to the league average.For this iteration, preparing the data for the app is now done using
tidyr::nest()
and purrr::map()
. What follows is a detailed explanation of how I approached this.It’s always valuable to have your end-state in mind when working through a multi-step data wrangle like this. My goal is the values shown on the “team plot” tab of the app – an index value (i.e. a percentage) of a team’s average runs scored (and runs allowed) compared to the league run scoring rate, for a single season.
a. Load packages and read the data
First, load the necessary packages, the first four of which are part of the tidyverse.# tidyverse packages library(dplyr) library(purrr) library(readr) library(tidyr) library(feather)Then, read in the data.
- It comes from the “Lahman” baseball database, so-named after the originator and host, Sean Lahman. The website www.seanlahman.com has the current version of the database, as well as an archive of previous versions.
- The data table “Teams” is sourced from the baseballdatabank on github https://github.com/chadwickbureau/baseballdatabank)); I’ve also saved in the github repo for this app.
- There is an R package with the Lahman database, but it’s currently two seasons behind the source files.
Teams <- read_csv("Teams.csv", col_types = cols( divID = col_character(), DivWin = col_character(), SF = col_character(), WCWin = col_character() )) head(Teams) ## # A tibble: 6 x 48 ## yearID lgID teamID franchID divID Rank G Ghome W L DivWin ## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 1871 <NA> BS1 BNA <NA> 3 31 NA 20 10 <NA> ## 2 1871 <NA> CH1 CNA <NA> 2 28 NA 19 9 <NA> ## 3 1871 <NA> CL1 CFC <NA> 8 29 NA 10 19 <NA> ## 4 1871 <NA> FW1 KEK <NA> 7 19 NA 7 12 <NA> ## 5 1871 <NA> NY2 NNA <NA> 5 33 NA 16 17 <NA> ## 6 1871 <NA> PH1 PNA <NA> 1 28 NA 21 7 <NA> ## # ... with 37 more variables: WCWin <chr>, LgWin <chr>, WSWin <chr>, ## # R <dbl>, AB <dbl>, H <dbl>, `2B` <dbl>, `3B` <dbl>, HR <dbl>, ## # BB <dbl>, SO <dbl>, SB <dbl>, CS <dbl>, HBP <dbl>, SF <chr>, RA <dbl>, ## # ER <dbl>, ERA <dbl>, CG <dbl>, SHO <dbl>, SV <dbl>, IPouts <dbl>, ## # HA <dbl>, HRA <dbl>, BBA <dbl>, SOA <dbl>, E <dbl>, DP <dbl>, ## # FP <dbl>, name <chr>, park <chr>, attendance <dbl>, BPF <dbl>, ## # PPF <dbl>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>The table above has far more variables than what we need, and some that we’ll have to calculate.
b. Create league summary tables
A short set of instructions that starts with the “Teams” table in the Lahman database and summarizes it for MLB run scoring trends Shiny appThus rather than having the app do the work of
- remove unnecessary records (rows) and fields (columns) and
- run the calculations for the runs-per-game, runs-allowed-per-game, and indexed versions of those,
i. create nested table
I started with the “Many Models”” chapter of Wickham and Grolemund, R for Data Science. (And thanks to Dr. Charlotte Wickham, whose training course was invaluable in helping me wrap my head around this.)At this point, the code
- filters out the years prior to 1901 and the misbegotten Federal League.
- and then creates a nested data table, starting with the
group_by()
year and league (lgID
)
# select a sub-set of teams from 1901 [the establishment of the American League] forward to most recent year Teams_lgyr <- Teams %>% filter(yearID > 1900, lgID != "FL") %>% group_by(yearID, lgID) %>% nest() Teams_lgyr ## # A tibble: 236 x 3 ## yearID lgID data ## <dbl> <chr> <list> ## 1 1901 AL <tibble [8 x 46]> ## 2 1901 NL <tibble [8 x 46]> ## 3 1902 AL <tibble [8 x 46]> ## 4 1902 NL <tibble [8 x 46]> ## 5 1903 AL <tibble [8 x 46]> ## 6 1903 NL <tibble [8 x 46]> ## 7 1904 AL <tibble [8 x 46]> ## 8 1904 NL <tibble [8 x 46]> ## 9 1905 AL <tibble [8 x 46]> ## 10 1905 NL <tibble [8 x 46]> ## # ... with 226 more rowsHere’s a quick peek inside the first entry of the “data” column…the American League, 1901.
Teams_lgyr$data[[1]] ## # A tibble: 8 x 46 ## teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 BLA NYY <NA> 5 134 66 68 65 <NA> <NA> N ## 2 BOS BOS <NA> 2 138 69 79 57 <NA> <NA> N ## 3 CHA CHW <NA> 1 137 71 83 53 <NA> <NA> Y ## 4 CLE CLE <NA> 7 138 69 54 82 <NA> <NA> N ## 5 DET DET <NA> 3 135 70 74 61 <NA> <NA> N ## 6 MLA BAL <NA> 8 139 70 48 89 <NA> <NA> N ## 7 PHA OAK <NA> 4 137 66 74 62 <NA> <NA> N ## 8 WS1 MIN <NA> 6 138 68 61 72 <NA> <NA> N ## # ... with 35 more variables: WSWin <chr>, R <dbl>, AB <dbl>, H <dbl>, ## # `2B` <dbl>, `3B` <dbl>, HR <dbl>, BB <dbl>, SO <dbl>, SB <dbl>, ## # CS <dbl>, HBP <dbl>, SF <chr>, RA <dbl>, ER <dbl>, ERA <dbl>, ## # CG <dbl>, SHO <dbl>, SV <dbl>, IPouts <dbl>, HA <dbl>, HRA <dbl>, ## # BBA <dbl>, SOA <dbl>, E <dbl>, DP <dbl>, FP <dbl>, name <chr>, ## # park <chr>, attendance <dbl>, BPF <dbl>, PPF <dbl>, teamIDBR <chr>, ## # teamIDlahman45 <chr>, teamIDretro <chr>
ii – functional programming
This step creates a league run scoring function, and then applies that using thepurrr::map()
function.Note:
- In the gapminder example in R for Data Science, the variables were called using their names. In this case, for a reason I have not yet determined, we have to specify the data object they are coming from; e.g. for the runs variable
R
, we have to usedf$R
(not justR
).
dplyr
:# base R format leagueRuns_fun <- function(df) { sum(data = df$R) } league_year_runs <- map(Teams_lgyr$data, leagueRuns_fun) league_year_runs[[1]] ## [1] 5873 #check the answer by old school `dplyr` method Teams %>% filter(yearID == 1901, lgID == "AL") %>% summarise(leagueruns = sum(R)) ## # A tibble: 1 x 1 ## leagueruns ## <dbl> ## 1 5873Now we move on to the calculation of league averages.
For the first approach, the sum calculation is part of the function.
- There are two functions, one for Runs and the other for Runs Allowed. This is because I have not yet figured out how to specify two different variables (i.e. the name of the data object and the variable to be used in the function) in the map_() function and successfully have them carried into my calculation functions
- Also note that in order to be consistent with other sources, the number of games played is calculated using the sum of wins (
W
) and losses (L
), rather than the number of games reported in theG
variable.
# functions leagueRPG_fun <- function(df) { sum(data = df$R) / (sum(data = df$W) + sum(data = df$L)) } leagueRAPG_fun <- function(df) { sum(data = df$RA) / (sum(data = df$W) + sum(data = df$L)) } # simple `map` version league_year_RPG <- map(Teams_lgyr$data, leagueRPG_fun) # embed as new columns in nested data object Teams_lgyr <- Teams_lgyr %>% mutate(lgRPG = map_dbl(Teams_lgyr$data, leagueRPG_fun), lgRAPG = map_dbl(Teams_lgyr$data, leagueRAPG_fun)) Teams_lgyr ## # A tibble: 236 x 5 ## yearID lgID data lgRPG lgRAPG ## <dbl> <chr> <list> <dbl> <dbl> ## 1 1901 AL <tibble [8 x 46]> 5.43 5.43 ## 2 1901 NL <tibble [8 x 46]> 4.69 4.69 ## 3 1902 AL <tibble [8 x 46]> 4.97 4.97 ## 4 1902 NL <tibble [8 x 46]> 4.09 4.09 ## 5 1903 AL <tibble [8 x 46]> 4.15 4.15 ## 6 1903 NL <tibble [8 x 46]> 4.85 4.85 ## 7 1904 AL <tibble [8 x 46]> 3.65 3.65 ## 8 1904 NL <tibble [8 x 46]> 3.98 3.98 ## 9 1905 AL <tibble [8 x 46]> 3.75 3.75 ## 10 1905 NL <tibble [8 x 46]> 4.16 4.16 ## # ... with 226 more rowsIn the second approach:
- the league and year total runs, runs allowed, and games are first calculated using separate functions
- RPG and RAPG for each league and year combination are then calculated outside the nested tibbles
# more functions - individual league by year totals leagueR_fun <- function(df) { sum(data = df$R) } leagueRA_fun <- function(df) { sum(data = df$RA) } leagueG_fun <- function(df) { (sum(data = df$W) + sum(data = df$L)) } Teams_lgyr <- Teams_lgyr %>% mutate(lgR = map_dbl(Teams_lgyr$data, leagueR_fun), lgRA = map_dbl(Teams_lgyr$data, leagueRA_fun), lgG = map_dbl(Teams_lgyr$data, leagueG_fun)) Teams_lgyr <- Teams_lgyr %>% mutate(lgRPG = (lgR / lgG), lgRAPG = (lgRA / lgG)) Teams_lgyr ## # A tibble: 236 x 8 ## yearID lgID data lgRPG lgRAPG lgR lgRA lgG ## <dbl> <chr> <list> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1901 AL <tibble [8 x 46]> 5.43 5.43 5873 5873 1082 ## 2 1901 NL <tibble [8 x 46]> 4.69 4.69 5194 5194 1108 ## 3 1902 AL <tibble [8 x 46]> 4.97 4.97 5407 5407 1088 ## 4 1902 NL <tibble [8 x 46]> 4.09 4.09 4494 4494 1098 ## 5 1903 AL <tibble [8 x 46]> 4.15 4.15 4543 4543 1096 ## 6 1903 NL <tibble [8 x 46]> 4.85 4.85 5349 5349 1102 ## 7 1904 AL <tibble [8 x 46]> 3.65 3.65 4433 4433 1216 ## 8 1904 NL <tibble [8 x 46]> 3.98 3.98 4872 4872 1224 ## 9 1905 AL <tibble [8 x 46]> 3.75 3.75 4547 4547 1212 ## 10 1905 NL <tibble [8 x 46]> 4.16 4.16 5092 5092 1224 ## # ... with 226 more rows
iii. save LG_RPG files
And then write csv and feather versions. As noted above, the shiny app now uses thefeather
format.Notes:
- rounding of variables after all the calculations, simply to make the tables as viewed more legible.
- renaming of variables to correspond with shiny app names.
LG_RPG <- Teams_lgyr %>% mutate(lgRPG = round(lgRPG, 2), lgRAPG = round(lgRAPG, 2)) %>% select(yearID, lgID, R = lgR, RA = lgRA, G = lgG, leagueRPG = lgRPG, leagueRAPG = lgRAPG) LG_RPG ## # A tibble: 236 x 7 ## yearID lgID R RA G leagueRPG leagueRAPG ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1901 AL 5873 5873 1082 5.43 5.43 ## 2 1901 NL 5194 5194 1108 4.69 4.69 ## 3 1902 AL 5407 5407 1088 4.97 4.97 ## 4 1902 NL 4494 4494 1098 4.09 4.09 ## 5 1903 AL 4543 4543 1096 4.15 4.15 ## 6 1903 NL 5349 5349 1102 4.85 4.85 ## 7 1904 AL 4433 4433 1216 3.65 3.65 ## 8 1904 NL 4872 4872 1224 3.98 3.98 ## 9 1905 AL 4547 4547 1212 3.75 3.75 ## 10 1905 NL 5092 5092 1224 4.16 4.16 ## # ... with 226 more rows write_csv(LG_RPG, "LG_RPG.csv") write_feather(LG_RPG, "LG_RPG.feather")
c. Repeat for MLB total
This only differs from the league summaries in the level of nesting; instead of grouping by year and league, it’s only year (yearID
).Teams_lgyr <- Teams_lgyr %>% unnest() %>% group_by(yearID) %>% nest() Teams_lgyr ## # A tibble: 118 x 2 ## yearID data ## <dbl> <list> ## 1 1901 <tibble [16 x 52]> ## 2 1902 <tibble [16 x 52]> ## 3 1903 <tibble [16 x 52]> ## 4 1904 <tibble [16 x 52]> ## 5 1905 <tibble [16 x 52]> ## 6 1906 <tibble [16 x 52]> ## 7 1907 <tibble [16 x 52]> ## 8 1908 <tibble [16 x 52]> ## 9 1909 <tibble [16 x 52]> ## 10 1910 <tibble [16 x 52]> ## # ... with 108 more rows Teams_lgyr <- Teams_lgyr %>% mutate(mlbR = map_dbl(Teams_lgyr$data, leagueR_fun), mlbRA = map_dbl(Teams_lgyr$data, leagueRA_fun), mlbG = map_dbl(Teams_lgyr$data, leagueG_fun), mlbRPG = (mlbR / mlbG), mlbRAPG = (mlbRA / mlbG)) Teams_lgyr ## # A tibble: 118 x 7 ## yearID data mlbR mlbRA mlbG mlbRPG mlbRAPG ## <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1901 <tibble [16 x 52]> 11067 11067 2190 5.05 5.05 ## 2 1902 <tibble [16 x 52]> 9901 9901 2186 4.53 4.53 ## 3 1903 <tibble [16 x 52]> 9892 9892 2198 4.50 4.50 ## 4 1904 <tibble [16 x 52]> 9305 9305 2440 3.81 3.81 ## 5 1905 <tibble [16 x 52]> 9639 9639 2436 3.96 3.96 ## 6 1906 <tibble [16 x 52]> 8881 8878 2416 3.68 3.67 ## 7 1907 <tibble [16 x 52]> 8703 8703 2406 3.62 3.62 ## 8 1908 <tibble [16 x 52]> 8422 8422 2456 3.43 3.43 ## 9 1909 <tibble [16 x 52]> 8810 8810 2436 3.62 3.62 ## 10 1910 <tibble [16 x 52]> 9584 9584 2446 3.92 3.92 ## # ... with 108 more rowsAnd again, we save the files for use in the shiny app.
MLB_RPG <- Teams_lgyr %>% mutate(mlbRPG = round(mlbRPG, 2), mlbRAPG = round(mlbRAPG, 2)) %>% select(yearID, R = mlbR, RA = mlbRA, G = mlbG, leagueRPG = mlbRPG, leagueRAPG = mlbRAPG) write_csv(MLB_RPG, "MLB_RPG.csv") write_feather(MLB_RPG, "MLB_RPG.feather")
d. Individual team values
Calculate index of team run scoring against league averageNote that we start with
unnest()
and create a new object, Teams_append
… a tibble with all of the variables exposed.Teams_append <- Teams_lgyr %>% unnest() %>% mutate(teamRPG=(R / (W + L)), teamRAPG=(RA / (W + L)), WLpct=(W / (W + L))) %>% # runs scored index where 100=the league average for that season mutate(R_index = (teamRPG / lgRPG) * 100) %>% mutate(R_index.sd = sd(R_index)) %>% mutate(R_z = (R_index - 100) / R_index.sd) %>% # runs allowed mutate(RA_index = (teamRAPG / lgRAPG) * 100) %>% mutate(RA_index.sd = sd(RA_index)) %>% mutate(RA_z = (RA_index - 100) / RA_index.sd) Teams_append ## # A tibble: 2,496 x 67 ## yearID mlbR mlbRA mlbG mlbRPG mlbRAPG lgID lgRPG lgRAPG lgR lgRA ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 2 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 3 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 4 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 5 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 6 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 7 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 8 1901 11067 11067 2190 5.05 5.05 AL 5.43 5.43 5873 5873 ## 9 1901 11067 11067 2190 5.05 5.05 NL 4.69 4.69 5194 5194 ## 10 1901 11067 11067 2190 5.05 5.05 NL 4.69 4.69 5194 5194 ## # ... with 2,486 more rows, and 56 more variables: lgG <dbl>, ## # teamID <chr>, franchID <chr>, divID <chr>, Rank <dbl>, G <dbl>, ## # Ghome <dbl>, W <dbl>, L <dbl>, DivWin <chr>, WCWin <chr>, LgWin <chr>, ## # WSWin <chr>, R <dbl>, AB <dbl>, H <dbl>, `2B` <dbl>, `3B` <dbl>, ## # HR <dbl>, BB <dbl>, SO <dbl>, SB <dbl>, CS <dbl>, HBP <dbl>, SF <chr>, ## # RA <dbl>, ER <dbl>, ERA <dbl>, CG <dbl>, SHO <dbl>, SV <dbl>, ## # IPouts <dbl>, HA <dbl>, HRA <dbl>, BBA <dbl>, SOA <dbl>, E <dbl>, ## # DP <dbl>, FP <dbl>, name <chr>, park <chr>, attendance <dbl>, ## # BPF <dbl>, PPF <dbl>, teamIDBR <chr>, teamIDlahman45 <chr>, ## # teamIDretro <chr>, teamRPG <dbl>, teamRAPG <dbl>, WLpct <dbl>, ## # R_index <dbl>, R_index.sd <dbl>, R_z <dbl>, RA_index <dbl>, ## # RA_index.sd <dbl>, RA_z <dbl>In this the final step, we first create a new data object
Teams_merge
.Notes:
- rounding of a variety of the calculated variables, to address readability concerns.
- selection and renaming of variables to correspond with shiny app names.
- then write csv and feather versions.
Teams_merge <- Teams_append %>% mutate(lgRPG = round(lgRPG, 2), lgRAPG = round(lgRAPG, 2), WLpct = round(WLpct, 3), teamRPG = round(teamRPG, 2), teamRAPG = round(teamRAPG, 2), R_index = round(R_index, 1), RA_index = round(RA_index, 1) ) %>% select(yearID, lgID, franchID, teamID, name, W, L, WLpct, R.x = R, RA.x = RA, teamRPG, leagueRPG = lgRPG, R_index, teamRAPG, leagueRAPG = lgRAPG, RA_index) Teams_merge ## # A tibble: 2,496 x 16 ## yearID lgID franchID teamID name W L WLpct R.x RA.x teamRPG ## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1901 AL NYY BLA Balt~ 68 65 0.511 760 750 5.71 ## 2 1901 AL BOS BOS Bost~ 79 57 0.581 759 608 5.58 ## 3 1901 AL CHW CHA Chic~ 83 53 0.61 819 631 6.02 ## 4 1901 AL CLE CLE Clev~ 54 82 0.397 666 831 4.9 ## 5 1901 AL DET DET Detr~ 74 61 0.548 741 694 5.49 ## 6 1901 AL BAL MLA Milw~ 48 89 0.35 641 828 4.68 ## 7 1901 AL OAK PHA Phil~ 74 62 0.544 805 760 5.92 ## 8 1901 AL MIN WS1 Wash~ 61 72 0.459 682 771 5.13 ## 9 1901 NL LAD BRO Broo~ 79 57 0.581 744 600 5.47 ## 10 1901 NL ATL BSN Bost~ 69 69 0.5 531 556 3.85 ## # ... with 2,486 more rows, and 5 more variables: leagueRPG <dbl>, ## # R_index <dbl>, teamRAPG <dbl>, leagueRAPG <dbl>, RA_index <dbl> write_csv(Teams_merge, "Teams_merge.csv") write_feather(Teams_merge, "Teams_merge.feather")And the feather files can now be incorproated into the shiny app.
-30-
To leave a comment for the author, please follow the link and comment on their blog: Bayes Ball.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.