Site icon R-bloggers

Grouped Sequences in dplyr

[This article was first published on R on Alan Yeung, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For a piece of work I had to calculate the number of matches that a team plays away from home in a row, which we will call days_on_the_road. I was not sure how to do this with dplyr but it’s basically a ‘grouped sequence’. For this post, I’ve created some dummy data to illustrate this idea. The num_matches_away variable is what we want to mimic using some data manipulation.

library(tidyverse)

sports_df <- tibble::tribble(
  ~team,    ~date,        ~home_or_away, ~num_matches_away,
  "Team A", "07/10/2022", "H",           0L,
  "Team A", "14/10/2022", "A",           1L,
  "Team A", "21/10/2022", "A",           2L,
  "Team A", "28/10/2022", "H",           0L,
  "Team A", "04/11/2022", "A",           1L,
  "Team A", "11/11/2022", "A",           2L,
  "Team A", "18/11/2022", "A",           3L,
  "Team A", "25/11/2022", "H",           0L,
  "Team A", "02/12/2022", "H",           0L,
  "Team A", "09/12/2022", "H",           0L,
  "Team B", "07/10/2022", "H",           0L,
  "Team B", "14/10/2022", "A",           1L,
  "Team B", "21/10/2022", "A",           2L,
  "Team B", "28/10/2022", "A",           3L,
  "Team B", "04/11/2022", "H",           0L,
  "Team B", "11/11/2022", "A",           1L,
  "Team B", "18/11/2022", "A",           2L,
  "Team B", "25/11/2022", "H",           0L,
  "Team B", "02/12/2022", "H",           0L,
  "Team B", "09/12/2022", "A",           1L
) %>%      
  mutate(date = as.Date(date, "%d/%m/%Y")) %>% 
  arrange(team, date)

sports_df
## # A tibble: 20 × 4
##    team   date       home_or_away num_matches_away
##    <chr>  <date>     <chr>                   <int>
##  1 Team A 2022-10-07 H                           0
##  2 Team A 2022-10-14 A                           1
##  3 Team A 2022-10-21 A                           2
##  4 Team A 2022-10-28 H                           0
##  5 Team A 2022-11-04 A                           1
##  6 Team A 2022-11-11 A                           2
##  7 Team A 2022-11-18 A                           3
##  8 Team A 2022-11-25 H                           0
##  9 Team A 2022-12-02 H                           0
## 10 Team A 2022-12-09 H                           0
## 11 Team B 2022-10-07 H                           0
## 12 Team B 2022-10-14 A                           1
## 13 Team B 2022-10-21 A                           2
## 14 Team B 2022-10-28 A                           3
## 15 Team B 2022-11-04 H                           0
## 16 Team B 2022-11-11 A                           1
## 17 Team B 2022-11-18 A                           2
## 18 Team B 2022-11-25 H                           0
## 19 Team B 2022-12-02 H                           0
## 20 Team B 2022-12-09 A                           1

I firstly came up with a complicated solution for this using a combination of slider::slide() and rle() (run length encoding).

sports_df2 <- sports_df %>% 
  group_by(team) %>% 
  mutate(days_on_road = unlist(slider::slide(rle(home_or_away)$length, ~1:.x)),
         days_on_road = if_else(home_or_away == "H", 0L, days_on_road)) %>% 
  ungroup()

identical(sports_df2$days_on_road, sports_df$num_matches_away)
## [1] TRUE

These are some pretty cool functions and it was nice to know rle() existed but I was never really happy with this solution as it seemed overly complex and it’s difficult to understand what the code is doing by simply reading it. So I asked a colleague to try to solve this problem and they came up with a better solution which I’m grateful for! It involves using a combination of group_by() and seq_len() which is a whole lot simpler to understand in my opinion.

sports_df3 <- sports_df2 %>%
  mutate(away = cumsum(home_or_away == "H")) %>%
  group_by(team, home_or_away, away) %>%
  mutate(days_on_road = seq_len(n())) %>%
  ungroup() %>%
  mutate(days_on_road = if_else(home_or_away == "H", 0L, days_on_road))

identical(sports_df3$days_on_road, sports_df$num_matches_away)
## [1] TRUE

I hope this is useful for some of you out there coding with R!

To leave a comment for the author, please follow the link and comment on their blog: R on Alan Yeung.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version