Grouped Sequences in dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a piece of work I had to calculate the number of matches that a team plays away from home in a row, which we will call days_on_the_road
. I was not sure how to do this with dplyr
but it’s basically a ‘grouped sequence’. For this post, I’ve created some dummy data to illustrate this idea. The num_matches_away
variable is what we want to mimic using some data manipulation.
library(tidyverse) sports_df <- tibble::tribble( ~team, ~date, ~home_or_away, ~num_matches_away, "Team A", "07/10/2022", "H", 0L, "Team A", "14/10/2022", "A", 1L, "Team A", "21/10/2022", "A", 2L, "Team A", "28/10/2022", "H", 0L, "Team A", "04/11/2022", "A", 1L, "Team A", "11/11/2022", "A", 2L, "Team A", "18/11/2022", "A", 3L, "Team A", "25/11/2022", "H", 0L, "Team A", "02/12/2022", "H", 0L, "Team A", "09/12/2022", "H", 0L, "Team B", "07/10/2022", "H", 0L, "Team B", "14/10/2022", "A", 1L, "Team B", "21/10/2022", "A", 2L, "Team B", "28/10/2022", "A", 3L, "Team B", "04/11/2022", "H", 0L, "Team B", "11/11/2022", "A", 1L, "Team B", "18/11/2022", "A", 2L, "Team B", "25/11/2022", "H", 0L, "Team B", "02/12/2022", "H", 0L, "Team B", "09/12/2022", "A", 1L ) %>% mutate(date = as.Date(date, "%d/%m/%Y")) %>% arrange(team, date) sports_df ## # A tibble: 20 × 4 ## team date home_or_away num_matches_away ## <chr> <date> <chr> <int> ## 1 Team A 2022-10-07 H 0 ## 2 Team A 2022-10-14 A 1 ## 3 Team A 2022-10-21 A 2 ## 4 Team A 2022-10-28 H 0 ## 5 Team A 2022-11-04 A 1 ## 6 Team A 2022-11-11 A 2 ## 7 Team A 2022-11-18 A 3 ## 8 Team A 2022-11-25 H 0 ## 9 Team A 2022-12-02 H 0 ## 10 Team A 2022-12-09 H 0 ## 11 Team B 2022-10-07 H 0 ## 12 Team B 2022-10-14 A 1 ## 13 Team B 2022-10-21 A 2 ## 14 Team B 2022-10-28 A 3 ## 15 Team B 2022-11-04 H 0 ## 16 Team B 2022-11-11 A 1 ## 17 Team B 2022-11-18 A 2 ## 18 Team B 2022-11-25 H 0 ## 19 Team B 2022-12-02 H 0 ## 20 Team B 2022-12-09 A 1
I firstly came up with a complicated solution for this using a combination of slider::slide()
and rle()
(run length encoding).
sports_df2 <- sports_df %>% group_by(team) %>% mutate(days_on_road = unlist(slider::slide(rle(home_or_away)$length, ~1:.x)), days_on_road = if_else(home_or_away == "H", 0L, days_on_road)) %>% ungroup() identical(sports_df2$days_on_road, sports_df$num_matches_away) ## [1] TRUE
These are some pretty cool functions and it was nice to know rle()
existed but I was never really happy with this solution as it seemed overly complex and it’s difficult to understand what the code is doing by simply reading it. So I asked a colleague to try to solve this problem and they came up with a better solution which I’m grateful for! It involves using a combination of group_by()
and seq_len()
which is a whole lot simpler to understand in my opinion.
sports_df3 <- sports_df2 %>% mutate(away = cumsum(home_or_away == "H")) %>% group_by(team, home_or_away, away) %>% mutate(days_on_road = seq_len(n())) %>% ungroup() %>% mutate(days_on_road = if_else(home_or_away == "H", 0L, days_on_road)) identical(sports_df3$days_on_road, sports_df$num_matches_away) ## [1] TRUE
I hope this is useful for some of you out there coding with R!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.