Predicting the Unpredictable- Analysing Rowing in Cambridge pt. 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my free time away from PhD and data science work, I (used to) enjoy rowing. Aside from obvious benefits like socialising, providing a (very intense) workout, seeing the outdoors at least a few times a week… there are really two things that I love(d) about rowing:
- It’s the sport that is closest to a simple engineering problem. Going fast basically boils down to how in time and how hard you can get 1-8 guys to move an oar through the water. Realistically, you could probably model how good a boat of guys will row just by tracking them on a rowing machine (and I have suspicions that this is what British Rowing etc. do for national teams).
- I learnt to row as an undergraduate at Oxford, and really got serious about it as a postgraduate student at Cambridge. This might seem like a irrelevant detail but it’s not.
How rowing usually works
Generally when racing boats, some n number of rowing crews line up alongside each other, and row straight down a lake (usually ~2km). The first boat to cross the finish line is generally considered the winner. For an example of such a race, see this Olympic final from 2012:
You might notice that there are four men in each boat here, each of whom are rowing. This works well on a reservoir where this race was held, but not so well on (e.g.) the River Cam that flows through Cambridge, which is both a pretty thin river, and has lots of tight corners.
How rowing in Cambridge/Oxford works
Instead of these rivers, boat typically contain 8 rowers, and one cox, who is responsible for steering the boat. In lieu of the space needed to row side by side, various races across the year are run as time trials down a portion of the river. The real highlight of the year however, are two four day competitions in which crews line up one-behind-the-other and attempt to chase down and ‘bump’ the crew ahead (before being chased down themselves).
Upon hitting the boat that starts ahead, the two crews switch places the next day and then the race is run again, until hopefully, the positions roughly reflect the speeds of the boats.
Predicting bumps races
Generally therefore, if two boats line up for a bumps race, the faster one should catch the slower boat (or if the positions are reversed, the faster boat should fail to be caught by the slower boat behind). It should be fairly easy to predict bumps races, but it isn’t. The nature of the relative inexperience of lots of crews, the panic of the races, and the pretty tight course means mistakes are made early and often.
However, I wanted to see how possible it was. The only real data to train predictions on are the time-trial races that happen before bumps, so I’m going to see how well it’s possible to model a bumps race using the implied speeds of crews from these previous time trials.
Libraries and Data
For this post, I’m only going to do some simple munging and logistic regression, so I only need the (new) version of the Tidyverse (as I’m also going to play with pivot_longer and pivot_shorter for the first time).
library(tidyverse)
The data comes from my own scraping of race results on the river cam over the last ten years. I’ll eventually package this up properly. For now it can be found at my Github. Today I’ll just read in the raw .csv files.
#download the raw data #wil lbe packages eventually race_results <- read.csv("https://raw.githubusercontent.com/RobWHickman/CamStroker/master/data-raw/cambridge_race_results.csv", stringsAsFactors = FALSE) bumps_results <- read.csv("https://raw.githubusercontent.com/RobWHickman/CamStroker/master/data-raw/cambridge_bumps_results.csv", stringsAsFactors = FALSE)
Data Munging
We then want to lengthen out the bumps data by days to squeeze as much data as possible out of possible combinations of boats we have data for. I need to line boats up by the start position each day, so I also init a column for this at the end
bumps_long <- bumps_results %>% #pivot bumps results to longer so we can model each day of racing pivot_longer(., starts_with("Day"), names_to = "Day", values_to = "Bump") %>% mutate(Bump = case_when( is.na(Bump) ~ 0, TRUE ~ as.numeric(Bump) )) %>% group_by(Competition, College, Year, Crew, Gender) %>% #calculate day start and end positions mutate(day_end = StartPos - cumsum(Bump)) %>% mutate(day_start = day_end + Bump) head(bumps_long) ## # A tibble: 6 x 10 ## # Groups: Competition, College, Year, Crew, Gender [2] ## Competition College Year Crew Gender StartPos Day Bump day_end ## <chr> <chr> <int> <int> <chr> <int> <chr> <dbl> <dbl> ## 1 Lent Caius 2016 NA M 1 Day1 0 1 ## 2 Lent Caius 2016 NA M 1 Day2 0 1 ## 3 Lent Caius 2016 NA M 1 Day3 0 1 ## 4 Lent Caius 2016 NA M 1 Day4 0 1 ## 5 Lent Downing 2016 NA M 2 Day1 0 2 ## 6 Lent Downing 2016 NA M 2 Day2 0 2 ## # ... with 1 more variable: day_start <dbl>
And we want to do the opposite for the race data so we can efficiently join this onto the bumps data. As the speed of the crew is all we care about I calculate this as the course distance / seconds taken to get an idea of roughly how fast each crew is.
race_wide <- race_results %>% #calculate implied racing speed mutate(race_id = paste(race, leg), av_speed = distance / seconds) %>% select(Year = year, College = college, Crew = crew, Gender = gender, race_id, av_speed) %>% pivot_wider(., id_cols = c("Year", "College", "Crew", "Gender"), names_from = race_id, values_from = av_speed) %>% #rename to tidy up rename(NSC = `Newnham Short Course NA`, Frbrn = `Fairbairns NA`, WH2H1 = `Winter Head 2 Head leg1`, WH2H2 = `Winter Head 2 Head leg2`, Rbnsn = `Robinson Head NA`) %>% #not much data for Robinson regatta so leave out select(-Rbnsn) head(race_wide) ## # A tibble: 6 x 8 ## Year College Crew Gender NSC Frbrn WH2H1 WH2H2 ## <int> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 2010 Catz 1 M 5.42 NA NA NA ## 2 2010 Robinson 1 M 5.41 NA NA NA ## 3 2010 Sidney 1 M 5.39 NA NA NA ## 4 2010 Caius 1 M 5.38 NA NA NA ## 5 2010 Girton 1 M 5.22 NA NA NA ## 6 2010 Churchill 1 M 5.21 NA NA NA
Then we simply join the data and calculate the implied speed differential between two crew who start a bumps race behind each other.
regression_data <- bumps_long %>% ungroup() %>% #join datasets left_join(race_wide, by = c("Year", "College", "Crew", "Gender")) %>% group_by(Competition, Year, Gender, Day) %>% arrange(Competition, Year, Gender, Day, day_start) %>% #calculate speed difference between boats starting bumps data behind each other mutate(frbrn_diff = Frbrn - lag(Frbrn), NSC_diff = NSC - lag(NSC), WH2H1_diff = WH2H1 - lag(WH2H1), WH2H2_diff = WH2H2 - lag(WH2H2)) %>% select(-NSC, -Frbrn, -WH2H1, -WH2H2) %>% #pivot longer for plotting pivot_longer(., frbrn_diff:WH2H2_diff, names_to = "race", values_to = "speed_difference") %>% filter(!is.na(speed_difference)) %>% #tidy up plotting data filter(Bump %in% c(1, 0)) %>% filter(Competition == "Lent") head(regression_data) ## # A tibble: 6 x 12 ## # Groups: Competition, Year, Gender, Day [2] ## Competition College Year Crew Gender StartPos Day Bump day_end ## <chr> <chr> <int> <int> <chr> <int> <chr> <dbl> <dbl> ## 1 Lent Sidney 2010 2 F 40 Day1 1 39 ## 2 Lent Christs 2010 2 F 41 Day1 0 41 ## 3 Lent Sidney 2010 2 F 40 Day2 0 39 ## 4 Lent Christs 2010 2 F 41 Day2 1 40 ## 5 Lent Magdal~ 2010 2 F 46 Day2 0 45 ## 6 Lent Newnham 2010 3 F 52 Day2 0 51 ## # ... with 3 more variables: day_start <dbl>, race <chr>, ## # speed_difference <dbl>
Plotting Data
We can then plot the data and see if the speed differential of races earlier in theyear is a useful predictor of bumping a boat ahead of you. We can model this as a logistic problem where bumping is either a 1 (to catch the boat ahead) or a 0 (did not catch). This does cut out some data in weird ways that I’ll get onto in later posts, but will do for now.
For the logistic regression I use geom_smooth and a binomial generalised linear model. Again, there’s more we can explore here, but this is just a quick intro post so we won’t worry about standrad error etc. I also split out by Male/Female crews as I imagine gender might play a role.
Given that this week is the first race of the 2019/2020 calendar (Fairbairns) I first limited myself to only data from that race.
p1 <- regression_data %>% #filter only Fairbairns results filter(race == "frbrn_diff") %>% ggplot(., aes(x = speed_difference, y = Bump)) + geom_point() + #model as a logistic event geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) + facet_wrap(~Gender, scales = "free_x") + theme_minimal() #plot p1
And it seems being faster than a boat ahead of you does increase your chance of bumping, but not a huge amount.
There’s good reason to beleive Fairbairns regatta might not be the best predictor of performance later in the year. It’s the first race, where many collges are still testing out their crew. It also takes place during a weekday, so many students cannot take part, and is a 4.5km race, instead of the usual 2km of later races and bumps itself.
If we look at how all races predict later bumps success we can see much nicer logistic curves:
#do the same but for all race data p2 <- regression_data %>% ggplot(., aes(x = speed_difference, y = Bump)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) + geom_vline(xintercept = 0, colour = "red", linetype = "dashed") + facet_grid(race~Gender, scales = "free_x") + theme_minimal() p2
Especially Newnham Short Course and the two legs of Winter Head 2 Head show nice curves where boats that are faster on these races have a greater chance of bumping later in the year.
There’s a lot more to do to properly model a bumps regatta, but the first step of validating our ideas and data seems to show promising results!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.