Analyzing the US Masters 2020 1-Hour ePostal Results

Welcome to Swimming + Data Science on Swimming + Data Science

2 years ago

[This article was first published on Welcome to Swimming + Data Science on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every year US Masters Swimming runs an event called the 1 Hour ePostal. Rules are simple – athletes swim as many lengths of a 25 yard (or longer) swimming pool as possible in one hour, during the month of February, and without the aid of equipment beyond a legal suit and goggles. Total distances can then be submitted electronically (hence the “e”). The athlete who covers the longest distance in each age/gender group is National Champion. Some people are specifically competing for that title, and perhaps shooting for associated national records. Others are just out to push themselves, to have a good time, or maybe just to get out the house.

In any case, the ePostal results are an interesting read. They’re posted here as .pdf files. The first step is to wrestle the .pdf into a dataframe.

I’m going to use a modified version of the Read_Results and Swim_Parse functions from the latest development version of my SwimmeR package. While it’s perhaps the most technically interesting (or at least involved) portion of this work, the function is also several hundred lines long. I won’t reproduce it in this blog post, but it, and all the other data discussed below, is available on github. I’ll start instead with data I’ve already read in.

library(tidyverse)
library(viridis)


urlfile <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/Postal_2020.csv"

df_2020 <- read_csv(url(urlfile))

head(df_2020)
## # A tibble: 6 x 9
##   Place Name            Age Club  Distance USMS_ID  Gender  Year National_Record
##   <dbl> <chr>         <dbl> <chr>    <dbl> <chr>    <chr>  <dbl> <chr>          
## 1     1 Jason Weis       24 NEM       4550 0202-0B… M       2020 N              
## 2     2 Ryan William…    19 1776      4095 080H-J7… M       2020 N              
## 3     1 Ryan Wadding…    29 DAM       5240 3805-HF… M       2020 N              
## 4     2 Austin Teuni…    25 AQST      5125 520H-KK… M       2020 N              
## 5     3 Andrew Barma…    27 CVMM      5070 330S-ND… M       2020 N              
## 6     4 Mitchell Vic…    26 CRUZ      4870 3800-0A… M       2020 N

USMS divides atheltes into age groups five years wide with the exception of the first. Athletes ages 18-24 are one age group, athletes 25-29 are another and so forth. Let’s add age groups to the dataframe.

df_2020 <- df_2020 %>% 
      mutate(Age_Group = case_when(
      Age <= 24 ~ "18-24",
      Age > 24 & Age < 30 ~ "25-29",
      Age > 29 & Age < 35 ~ "30-34",
      Age > 34 & Age < 40 ~ "35-39",
      Age > 39 & Age < 45 ~ "40-44",
      Age > 44 & Age < 50 ~ "45-49",
      Age > 49 & Age < 55 ~ "50-54",
      Age > 54 & Age < 60 ~ "55-59",
      Age > 59 & Age < 65 ~ "60-64",
      Age > 64 & Age < 70 ~ "65-69",
      Age > 69 & Age < 75 ~ "70-74",
      Age > 74 & Age < 80 ~ "75-79",
      Age > 79 & Age < 85 ~ "80-84",
      Age > 84 & Age < 90 ~ "85-89",
      Age > 89 & Age < 95 ~ "90-94",
      Age > 94 & Age < 100 ~ "95-99",
      Age > 99 & Age < 105 ~ "100-104"
    ))

Now that we’ve got my data the way we want it let’s do some quick stats and plots just to get a feel for it.

Every USMS athlete has a unique identifying number, stored under the USMS_ID variable. Let’s see how many athletes participated.

length(unique(df_2020$USMS_ID))
## [1] 970

A total of 970 people swam more or less continuously for one hour in the month of Febuary 2020 and then submitted their times. Not too shabby!

Let’s look at how those athletes are distributed by age group and gender.

df_2020 %>%
  group_by(Age_Group, Gender) %>%
  summarise(`No. of athletes` = n()) %>%
  ggplot(aes(x = Age_Group, y = `No. of athletes`, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_viridis(discrete = TRUE) +
  labs(x = "Age Group") +
  theme_bw() +
  theme(axis.text.x  = element_text(angle = 45, vjust = 0.5))

So significantly more women participated up until the 70-74 age group, then more participants were men, including the two oldest age groups represented, where all the participants were men. Not what I would have expected based on general life expectancy data – an interesting find!

Now let’s take a look at the total results. A distribution of distances swam by each age group and split by gender would be interesting.

facet_labs <- c("Women", "Men")
names(facet_labs) <- c("W", "M")

df_2020 %>%
  ggplot(aes(y = Distance, x = Age_Group)) +
  geom_boxplot(
    aes(fill = as.factor(Gender)),
    position = position_dodge(1),
    outlier.shape = NA, alpha = 0.6) +
  geom_dotplot(
    aes(color =  as.factor(Gender), fill = as.factor(Gender)),
    binaxis = "y", stackdir = 'center',
    position = position_dodge(1), binwidth = 25,
    dotsize = 3) +
  theme_bw() +
  labs(x = "Age Group", y = "Distance (y)",
       title = "2020 US Masters ePostal Results") +
  theme(plot.title = element_text( hjust = 0.5, vjust = 0.5, face = "bold")) +
  theme(axis.text.x  = element_text(angle = 45, vjust = 0.5)) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  guides(color = FALSE, alpha = FALSE, fill = FALSE) +
  facet_wrap(. ~ Gender, labeller =  labeller(Gender = facet_labs))

Not surprisingly younger athletes swam further on average than older athletes. It’s tempting to look at this data and conclude that as people get older their athletic performance declines. While that may be true we can’t infer any causal relationships here, because this is just observational data. Age related declines are one possible explanation, but there are others. For example, what if only the best athletes join USMS and participate in the ePostal when they’re in their teens and twenties? This hypothesis also matches with the data, because there are fewer athletes in the younger age groups, more in the older (at least to a point), and the top end performers in each age group perform about the same until the 60-64 age group. Perhaps as people age they return to swimming, or even start swimming without having swam before, thereby increasing participation numbers, but lowering the average distance swam by older age groups.

facet_labs <- c("Women", "Men")
names(facet_labs) <- c("W", "M")

df_2020 %>%
  group_by(Age_Group, Gender) %>%
  summarise(
    `Avg. Distance` = mean(Distance, na.rm = TRUE),
    `Top Distance` = max(Distance, na.rm = TRUE)) %>%
  pivot_longer(
    cols = c("Avg. Distance", "Top Distance"),
    names_to = "Distance Type",
    values_to = "Distance") %>%
  ggplot(aes(x = Age_Group, y = Distance, color = Gender)) +
  geom_point(aes(shape = `Distance Type`)) +
  geom_line(aes(group = `Distance Type`, linetype = `Distance Type`)) +
  theme_bw() +
  theme(axis.text.x  = element_text(angle = 45, vjust = 0.5)) +
  facet_wrap(. ~ Gender, labeller =  labeller(Gender = facet_labs)) +
  scale_fill_viridis(discrete = TRUE) +
  scale_color_viridis(discrete = TRUE) +
  guides(color = FALSE) +
  geom_hline(
    data = data.frame(yint = 5011, Gender = "W"),
    aes(yintercept = yint), alpha = 0.5, linetype = 2) +
  geom_hline(
    data = data.frame(yint = 5184, Gender = "M"),
    aes(yintercept = yint), alpha = 0.5, linetype = 2)
## `summarise()` regrouping output by 'Age_Group' (override with `.groups` argument)

Join me in the next post where I’ll examine 1 Hour ePostal results over the last two decades.

To leave a comment for the author, please follow the link and comment on their blog: Welcome to Swimming + Data Science on Swimming + Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.