Fast, Deep, Fast and Deep
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Swimming + Data Science has been quiet for a while as I worked on other projects. Today though we’re back at it in an attempt to quantify the speed and depth of swim meets. It’s more than that though, because this is vintage me, back on my hobby horses. We will be using SwimmeR
. We will be focusing, perhaps myopically, on swim meets from my local area. We will also be using a discussion I had with some friends about old swimming results as a jumping off point. It’s a perfect Swimming + Data Science trifecta!
So here’s the story. I was officiating at the Section IV Class C sectional finals meet, held recently. During an awards break I got into a conversation with a couple of people about swim meets past, as I am wont to do. These people, and I’m paraphrasing, remarked that
- The meet was generally both slower and less deep than previous years, particularly 2006
- That a specific team from 2006 that these people were associated with would have run rampant all over the current iteration of the meet
Now I enjoy a good “things were better in my day” bit as much as anyone – perhaps even more – but I also enjoy testing assertions. So that’s what we’re going to do. These tests will proceed in two parts, one for each assertion.
To start let’s load some packages. There’s a new version of SwimmeR
, v0.13.0, released November 5th 2021. It’s great and has lots of new features, none of which are particularly relevant to the contents of this article. Still though, might as well update.
Setup
install.packages("SwimmeR")
We’ll use a few tidyverse
packages plus flextable
.
library(SwimmeR) library(dplyr) library(stringr) library(tidyr) library(flextable) flextable_style <- function(x) { x %>% flextable() %>% bold(part = "header") %>% # bolds header bg(bg = "#D3D3D3", part = "header") # puts gray background behind the header row }
Collecting the meet results
Collecting meet results with SwimmeR
is easy. Simply find a link to the results and pipe them into read_results
followed by swim_parse
. here we’ll do that with both the 2006 and 2021 Section VI Class C meet results.
C_2006 <- "http://www.section4swim.com/Results/GirlsHS/2006/Sec4/C/Single.htm" %>% read_results() %>% swim_parse() %>% mutate(Year = 2006) %>% mutate(Event = str_remove(Event, "Girls ")) C_2006 <- C_2006 %>% mutate(Event = factor(Event, levels = unique(C_2006$Event))) C_2021 <- "http://www.section4swim.com/Results/GirlsHS/2021/Sec4/C/Single.htm" %>% read_results() %>% swim_parse() %>% mutate(Year = 2021) %>% mutate(Event = str_remove(Event, "Girls ")) C_2021 <- C_2021 %>% mutate(Event = factor(Event, levels = unique(C_2021$Event)))
Assertion #1 – Faster and Deeper
Fast?
First off let me apologize to our diving friends. “Fast” is a term commonly used by swimming people to describe an entire meet, including diving. Diving obviously isn’t timed though, so the better term might be “strength” – a strong meet, a weak meet etc. Still, I’m going to mostly stick with references to speed just because that’s the more common term.
“Fast” is almost always a relative measure. Is a 4:56.00 in the 500 free fast? Depends on the context. That time would have won the 500 free in a recent men’s dual meet at my alma mater – it would be a fast time in the context of that meet. On the other hand in this women’s dual meet at Stanford a 4:56.00 would have been good for 6th – i.e. last place. Katie Ledecky won that 500 going 4:33.94 – which is fast full stop. What we’d really like is a quantitative measure of the speed for each event, which taken together describe the strength of a meet.
I propose that what’s really meant by a statement like “the meet is fast this year” is that in some kind of weighted fashion the times are faster than years past for the same finish places. Sounds obvious I know. What’s the weighing though? We need it in order to be quantitative. Luckily it’s already been determined. It’s point value by place. First place in these meets is worth 20 points, 16th place is worth 1 point and it seems reasonable to me that the first place time is 20x more important in describing the speed of a meet than the 16th place time. The 16th place time does still matter a bit though.
Let’s try and compare the “fastness” between the 2006 and 2021 meets. Our goal is to use the weighed.mean
function on times and place-points to get a time (or score for diving), describing the overall strength of each event. We’ll stick C_2006
and C_2021
together and then summarise
using weighed.mean
and then pivot_wider
to format a nice table.
C_Combined <- C_2021 %>% bind_rows(C_2006) %>% filter(Exhibition < 1, DQ < 1) %>% group_by(Event) %>% select(Place, Name, Team, Event, Finals_Time, Points, Year) %>% filter(Place <= 16) %>% filter(is.na(Points) == FALSE) C_Combined %>% group_by(Event, Year) %>% summarise(W_Avg = mmss_format(weighted.mean(sec_format(Finals_Time), Points, na.rm = TRUE))) %>% mutate(W_Avg = case_when(str_detect(Event, "Diving") ~ as.character(sec_format(W_Avg)), TRUE ~ W_Avg)) %>% pivot_wider(names_from = "Year", values_from = "W_Avg") %>% flextable_style()
Event | 2006 | 2021 |
200 Yard Medley Relay | 2:08.18 | 2:14.39 |
200 Yard Freestyle | 2:14.29 | 2:18.60 |
200 Yard IM | 2:32.42 | 2:36.67 |
50 Yard Freestyle | 26.63 | 27.64 |
1 mtr Diving | 238.13 | 260.45 |
100 Yard Butterfly | 1:09.05 | 1:13.90 |
100 Yard Freestyle | 59.21 | 1:02.58 |
500 Yard Freestyle | 6:07.36 | 6:23.79 |
200 Yard Freestyle Relay | 1:54.92 | 1:57.09 |
100 Yard Backstroke | 1:08.89 | 1:12.08 |
100 Yard Breaststroke | 1:18.93 | 1:22.85 |
400 Yard Freestyle Relay | 4:04.83 | 4:21.62 |
The 2006 meet was faster than the 2021 meet in every event except diving. The margin was about 4 seconds per 100 – which is a lot. Diving is interesting too – let’s look closer.
C_Combined %>% ungroup() %>% filter(str_detect(Event, "Diving")) %>% select(Place, "Score" = Finals_Time, Year) %>% pivot_wider(names_from = "Year", values_from = "Score") %>% select(Place, `2006`, `2021`) %>% flextable_style()
Place | 2006 | 2021 |
1 | 284.80 | 267.10 |
2 | 267.30 | 258.05 |
3 | 258.50 | 254.70 |
4 | 239.50 | |
5 | 228.65 | |
6 | 225.50 | |
7 | 222.45 | |
8 | 176.85 | |
9 | 169.60 |
In 2006 there were 9 divers and a winning score of 284.80. In 2021 there were 3 divers and a winning score of 267.10. Yes the weighed average score was higher for 2021 but I ask you – do you think 2021 was a better year for the sport of diving in Section IV Class C than 2006? I sure don’t. As I discuss frequently in this space, models and quantitative analysis are great, but it does help to actually know things too.
Deep?
A quantitative description of meet depth is simpler – it’s the difference in time (or diving score) between 1st place and some other place of interest – usually the last scoring place. If the difference is small the meet is deep. Here we’ll compute the difference, in seconds, between the min
and max
scoring marks.
C_Combined %>% group_by(Event, Year) %>% mutate(Finals_Time = sec_format(Finals_Time)) %>% summarise(Range = max(Finals_Time, na.rm = TRUE) - min(Finals_Time, na.rm = TRUE)) %>% mutate(Range = case_when(str_detect(Event, "Diving", negate = TRUE) ~ mmss_format(Range), TRUE ~ as.character(Range))) %>% pivot_wider(names_from = "Year", values_from = "Range") %>% flextable_style()
Event | 2006 | 2021 |
200 Yard Medley Relay | 21.56 | 51.07 |
200 Yard Freestyle | 27.07 | 34.50 |
200 Yard IM | 36.27 | 50.27 |
50 Yard Freestyle | 04.69 | 04.79 |
1 mtr Diving | 115.2 | 12.4 |
100 Yard Butterfly | 18.06 | 43.41 |
100 Yard Freestyle | 10.52 | 16.95 |
500 Yard Freestyle | 56.91 | 2:27.39 |
200 Yard Freestyle Relay | 22.53 | 27.86 |
100 Yard Backstroke | 15.95 | 26.31 |
100 Yard Breaststroke | 16.16 | 31.73 |
400 Yard Freestyle Relay | 39.29 | 46.32 |
The range is much smaller in 2006 for all swimming events, meaning the meet was much deeper.
Faster and Deeper
Those people were right on both counts. The 2006 meet was indeed faster and deeper, by a wide margin, than the 2021 meet.
Assertion #2 – An old team returns
The team in question here is the 2006 Dryden Lady Lions, headed by eventual DIII All-American Sheila Rhoades.
To start we’ll need to remove all of the Dryden results from C_2021
– it wouldn’t be fair for Dryden to have two teams. Then we’ll collect only the Dryden results from 2006 and join the two sets of results together. The 2006 Dryden team will then be “swimming” in the 2021 meet.
C_2021_No_Dryden <- C_2021 %>% filter(Team != "Dryden") C_2006_Only_Dryden <- C_2006 %>% filter(Team == "Dryden") C_Comp <- C_2021_No_Dryden %>% bind_rows(C_2006_Only_Dryden) %>% mutate(Team = str_replace(Team, "Susquehanna Vall$", "Susquehanna Valley"), Team = str_replace(Team, "Notre Dame-E$", "Notre Dame-Elmira"), Team = str_replace(Team, "Watkins Glen-OM", "Watkins Glen-Odessa Montour"))
It’s more complicated than that though. We need to reconstruct prelims to sort athletes into A Finals and B Finals before the finals themselves can be scored. Fortunately we have access to everyone’s Prelims_Time
, although because of how SwimmeR
works some of them are in the Finals_Time
column. In SwimmeR
the Finals_Time
represents the final swim for each swimmer, which may or may not actually be in the finals. It’s confusing I know and I apologize, but it’s important to the underlying functioning of swim_parse
so that’s the way it will stay. Anyway, swimmers who have Finals_Time
s but no Prelims_Time
in a prelims-finals meet actually have their prelims performance listed in Finals_Time
. Let’s fix our results.
We’ll also need to handle diving separately, because in diving the highest value (score) wins, whereas in swimming the lowest value (time) wins.
C_Comp <- C_Comp %>% mutate(Prelims_NA = case_when(is.na(Prelims_Time) ~ 1, TRUE ~ 0)) %>% mutate(Prelims_Time = case_when(is.na(Prelims_Time) ~ Finals_Time, TRUE ~ Prelims_Time)) %>% mutate(Finals_Time = case_when(Prelims_NA > 0 ~ "NA", TRUE ~ Finals_Time)) %>% na_if("NA") %>% select(-Prelims_NA) C_Comp_Diving <- C_Comp %>% filter(str_detect(Event, "Diving") == TRUE) %>% mutate(Prelims_Time = as.numeric(Prelims_Time)) %>% mutate( Place = rank(desc(Prelims_Time), ties.method = "min"), # highest score gets rank 1 Prelims_Time = as.character(Prelims_Time) ) C_Comp_Swimming <- C_Comp %>% filter(str_detect(Event, "Diving") == FALSE) %>% filter(DQ < 1) %>% mutate(Prelims_Time_sec = SwimmeR::sec_format(Prelims_Time)) %>% # time as seconds group_by(Event) %>% mutate(Place = rank(Prelims_Time_sec, ties.method = "min")) %>% # places, low number wins arrange(Place) C_Comp_Prelims <- C_Comp_Diving %>% bind_rows(C_Comp_Swimming)
Great, now we have our meet, with athletes sorted into A and B finals by place, with 8th place as the cutoff for the A finals. We can use results_score
to rescore the meet on the basis of Finals_Time
.
C_Finals_Score <- C_Comp_Prelims %>% results_score( events = unique(C_2021$Event), meet_type = "prelims_finals", lanes = 8, point_values = c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1), scoring_heats = 2 ) C_Finals_Score %>% group_by(Team) %>% summarise(Score = sum(Points, na.rm = TRUE)) %>% mutate(Place = round(rank(desc(Score), ties.method = "min"), 1)) %>% arrange(desc(Score)) %>% select(Place, everything()) %>% flextable_style()
Place | Team | Score |
1 | Dryden | 447 |
2 | Watkins Glen-Odessa Montour | 435 |
3 | Lansing | 325 |
4 | Chenango Forks | 224 |
4 | Southern Cayuga | 224 |
6 | Notre Dame-Elmira | 174 |
7 | Greene | 101 |
8 | Susquehanna Valley | 88 |
9 | Whitney Point | 15 |
My friends are vindicated again. Dryden 2006 would have indeed won the 2021 meet – although only by 12 points. The real question is how did they know?
In Conclusion
We’ve come up with a quantitative framework for determining the speed and depth of a swim meet, which I’m sure will come in handy somewhere. We’ve also made use of some SwimmeR
functions and lots of tidyverse
functions and hopefully learned a little bit on the way. Join us next time back here at Swimming + Data Science where we’ll be doing something else fun.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.