Left-handed catchers
[This article was first published on Bayes Ball, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Benny Distefano – 1985 Donruss #166 (source: baseball-almanac.com) |
Jack Moore, writing on the site Sports on Earth in 2013 (“Why no left-handed catchers?”), points out that lack of left-handed catchers goes back a long way. One interesting piece of evidence is a 1948 Ripley’s “Believe It Or Not” item with a left-handed catcher Dick Bernard (you can read more about Bernard’s signing in the July 1, 1948 edition of the Tuscaloosa News). Bernard didn’t make the majors, and doesn’t appear in any of the minor league records that are available on-line either.
Dick Bernard in Ripley’s “Believe It or Not”, 1948-12-30 (source: sportsonearth.com) |
There are a variety of hypotheses why there are no left-handed catchers, all of which are summarized in John Walsh’s “Top 10 Left-Handed Catchers for 2006” (a tongue-in-cheek title if ever there were) at The Hardball Times. A compelling explanation, and one supported by both Bill James and J.C. Bradbury (in his book The Baseball Economist) is natural selection; a left-handed little league player who can throw well will be groomed as a pitcher.
Throwing hand by fielding position as an example of a categorical variable
I was looking for some examples of categorical variables to display visually, and the lack of left-handed throwing catchers, compared to other positions, came to mind. The following uses R, and the Lahman database package.
The analysis requires merging the Master and Fielding tables in the Lahman database – the Master table gives the player’s name and his throwing hand, and Fielding tells us how many games at each position they played. For the purpose of this analysis, we’ll look at the seasons 1954 (the first year in the Lahman database that has the outfield positions split into left, centre, and right) through 2012.
You may note that for the merging of the two tables, I used the new dplyr package. I tested the system.time of the basic version of “merge” to combine the two tables, and the “inner_join” in dplyr. The latter is substantially faster: my aging computer ran “merge” in about 5.5 seconds, compared to 0.17 seconds with dplyr.
# load the required packages require(Lahman) require(dplyr) #
The first step is to create a new data table that merges the Fielding and Master tables, based on the common variable “playerID”. This new table has one row for each player, by position and season; we use the dim function to show the dimensions of the table.
Then, select only those seasons since 1954 and omit the records that are Designated Hitter (DH) and the summary of outfield positions (OF) (i.e. leave the RF, CF, and LF).
MasterFielding <- inner_join(Fielding, Master, by="playerID") dim(MasterFielding) ## [1] 164903 52 # MasterFielding <- filter(MasterFielding, POS != "OF" & POS != "DH" & yearID > "1953") dim(MasterFielding) ## [1] 91214 52
This table needs to be summarized one step further – a single row for each player, counting how many games played at each position.
Player_games <- MasterFielding %.% group_by(playerID, nameFirst, nameLast, POS, throws) %.% summarise(gamecount = sum(G)) %.% arrange(desc(gamecount)) dim(Player_games) ## [1] 19501 6 head(Player_games) ## Source: local data frame [6 x 6] ## Groups: playerID, nameFirst, nameLast, POS ## ## playerID nameFirst nameLast POS throws gamecount ## 1 robinbr01 Brooks Robinson 3B R 2870 ## 2 bondsba01 Barry Bonds LF L 2715 ## 3 vizquom01 Omar Vizquel SS R 2709 ## 4 mayswi01 Willie Mays CF R 2677 ## 5 aparilu01 Luis Aparicio SS R 2583 ## 6 jeterde01 Derek Jeter SS R 2531
This table shows the career records for the most games played at the positions (for 1954-2012). We see that Brooks Robinson leads the way with 2,870 games played at third base, and the fact that Derek Jeter, at the end of the 2012 season, was closing in on Omar Vizquel’s career record for games played as a shortstop.
Cross-tab Tables
The next step is to prepare a simple cross-tab table (also known as contingency or pivot tables) showing the number of players cross-tabulated by position (POS) and throwing hand (throws).
Here, I’ll demonstrate two ways to do this: first with dplyr’s “group_by” and “summarise” (with a bit of help from reshape2), and then the “table” function in gmodels.
# first method - dplyr Player_POS <- Player_games %.% group_by(POS, throws) %.% summarise(playercount = length(gamecount)) Player_POS ## Source: local data frame [17 x 3] ## Groups: POS ## ## POS throws playercount ## 1 1B L 411 ## 2 1B R 1515 ## 3 2B L 4 ## 4 2B R 1560 ## 5 3B L 4 ## 6 3B R 1889 ## 7 C L 4 ## 8 C R 980 ## 9 CF L 393 ## 10 CF R 1252 ## 11 LF L 544 ## 12 LF R 2161 ## 13 P L 1452 ## 14 P R 3623 ## 15 RF L 520 ## 16 RF R 1893 ## 17 SS R 1296
To transform this long-form table into a traditional cross-tab shape we can use the “dcast” function in reshape2.
require(reshape2) ## Loading required package: reshape2 dcast(Player_POS, POS ~ throws, value.var = "playercount") ## POS L R ## 1 1B 411 1515 ## 2 2B 4 1560 ## 3 3B 4 1889 ## 4 C 4 980 ## 5 CF 393 1252 ## 6 LF 544 2161 ## 7 P 1452 3623 ## 8 RF 520 1893 ## 9 SS NA 1296
A second method to get the same result is to use the “table” function in the gmodels package.
require(gmodels) ## Loading required package: gmodels throwPOS <- with(Player_games, table(POS, throws)) throwPOS ## throws ## POS L R ## 1B 411 1515 ## 2B 4 1560 ## 3B 4 1889 ## C 4 980 ## CF 393 1252 ## LF 544 2161 ## P 1452 3623 ## RF 520 1893 ## SS 0 1296
A more elaborate table can be created using gmodels package. In this case, we’ll use the CrossTable function to generate a table with row percentages. You’ll note that the format is set to SPSS, so the table output resembles that software’s display style.
CrossTable(Player_games$POS, Player_games$throws, digits=2, format="SPSS", prop.r=TRUE, prop.c=FALSE, prop.t=FALSE, prop.chisq=FALSE, # keeping the row proportions chisq=TRUE) # adding the ChiSquare statistic ## ## Cell Contents ## |-------------------------| ## | Count | ## | Row Percent | ## |-------------------------| ## ## Total Observations in Table: 19501 ## ## | Player_games$throws ## Player_games$POS | L | R | Row Total | ## -----------------|-----------|-----------|-----------| ## 1B | 411 | 1515 | 1926 | ## | 21.34% | 78.66% | 9.88% | ## -----------------|-----------|-----------|-----------| ## 2B | 4 | 1560 | 1564 | ## | 0.26% | 99.74% | 8.02% | ## -----------------|-----------|-----------|-----------| ## 3B | 4 | 1889 | 1893 | ## | 0.21% | 99.79% | 9.71% | ## -----------------|-----------|-----------|-----------| ## C | 4 | 980 | 984 | ## | 0.41% | 99.59% | 5.05% | ## -----------------|-----------|-----------|-----------| ## CF | 393 | 1252 | 1645 | ## | 23.89% | 76.11% | 8.44% | ## -----------------|-----------|-----------|-----------| ## LF | 544 | 2161 | 2705 | ## | 20.11% | 79.89% | 13.87% | ## -----------------|-----------|-----------|-----------| ## P | 1452 | 3623 | 5075 | ## | 28.61% | 71.39% | 26.02% | ## -----------------|-----------|-----------|-----------| ## RF | 520 | 1893 | 2413 | ## | 21.55% | 78.45% | 12.37% | ## -----------------|-----------|-----------|-----------| ## SS | 0 | 1296 | 1296 | ## | 0.00% | 100.00% | 6.65% | ## -----------------|-----------|-----------|-----------| ## Column Total | 3332 | 16169 | 19501 | ## -----------------|-----------|-----------|-----------| ## ## ## Statistics for All Table Factors ## ## ## Pearson's Chi-squared test ## ------------------------------------------------------------ ## Chi^2 = 1759 d.f. = 8 p = 0 ## ## ## ## Minimum expected frequency: 168.1
Mosaic Plot
A mosaic plot is an effective way to graphically represent the contents of the summary tables. Note that the length (left to right) dimension of each bar is constant, comparing proportions, while the height of the bar (top to bottom) varies depending on the absolute number of cases. The mosaic plot function is in the vcd package.
require(vcd) ## Loading required package: vcd ## Loading required package: grid mosaic(throwPOS, highlighting = "throws", highlighting_fill=c("darkgrey", "white"))
Conclusion
The clear result is that it’s not just catchers that are overwhelmingly right-handed throwers, it’s also infielders (except first base). There have been very few southpaws playing second and third base – and there have been absolutely no left-handed throwing shortstops in this period.As J.G. Preston puts it in the blog post “Left-handed throwing second basemen, shortstops and third basemen”,
While right-handed throwers can be found at any of the nine positions on a baseball field, left-handers are, in practice, restricted to five of them.
So who are these left-handed oddities? Using the filter function, it’s easy to find out:
# catchers filter(Player_games, POS == "C", throws == "L") ## Source: local data frame [4 x 6] ## Groups: playerID, nameFirst, nameLast, POS ## ## playerID nameFirst nameLast POS throws gamecount ## 1 distebe01 Benny Distefano C L 3 ## 2 longda02 Dale Long C L 2 ## 3 squirmi01 Mike Squires C L 2 ## 4 shortch02 Chris Short C L 1 # second base filter(Player_games, POS == "2B", throws == "L") ## Source: local data frame [4 x 6] ## Groups: playerID, nameFirst, nameLast, POS ## ## playerID nameFirst nameLast POS throws gamecount ## 1 marqugo01 Gonzalo Marquez 2B L 2 ## 2 crowege01 George Crowe 2B L 1 ## 3 mattido01 Don Mattingly 2B L 1 ## 4 mcdowsa01 Sam McDowell 2B L 1 # third base filter(Player_games, POS == "3B", throws == "L") ## Source: local data frame [4 x 6] ## Groups: playerID, nameFirst, nameLast, POS ## ## playerID nameFirst nameLast POS throws gamecount ## 1 squirmi01 Mike Squires 3B L 14 ## 2 mattido01 Don Mattingly 3B L 3 ## 3 francte01 Terry Francona 3B L 1 ## 4 valdema02 Mario Valdez 3B L 1
My github file for this entry in Markdown is here: [https://github.com/MonkmanMH/Bayesball/blob/master/LeftHandedCatchers.md]
-30-
To leave a comment for the author, please follow the link and comment on their blog: Bayes Ball.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.