Expected goals, gamestate, and predictiveness

[This article was first published on Tony's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

“The best predictor for future performance is Expected Goals”, authored by Sander Ijtsma of 11tegen11 in 2015, is one of the most notable articles in public soccer analytics. The write-up provided compelling evidence for the superiority of expected goals (xG) in terms of forecasting team-season outcomes on a game-by-game basis.12 Ijtsma drew out curves for running R-squared values of a handful of season-to-date (“past”) metrics with respect to rest-of-season (“future”) performance indicators–goals ratio and points per game–and found that xG ratio3 tended to be the most predictive.

In 2022, Eliot McKinley wrote a cool piece for American Soccer Analytics (ASA) where he replicated and extended Ijtsma’s analysis. He used a larger data set of more recent seasons–2018 through 2022– and added a look at Major League Soccer (MLS) in addition to the the Big 5 European Leagues, and adding. Further, Eliot presented a novel aspect where he bootstrapped the procedure for generating running R-squared values to generate smoother, more interpretable curves.

In this post I extend this prior art in two ways:

  1. I expand the data set range to 2018 through 2024, sourcing from a reputed public data source–FBref.4 The intention is to make the analysis even more robust and reproducible.
  2. I include additional curves for “gamestate-aware” goals and xG, in an effort to see if one can achieve metrics that are perhaps even more predictive than xG itself.

On the second point, Ijtsma noted that subsetting the past performance tallies to gamestates when the score is either tied or within 1 goal might abate the potential biases introduced by within-match complacency (manifesting in conservative tactics). In fact, he said he had evaluated these things.

I’ve looked at this, but I have decided not to include that subanalysis in this post, for the sake of accessibility… I think this method will show that the phenomenon [where teams exert more or less effort in matches against particular opposition] either does not truly exist, or that its effect is so small that correcting for this will allow more noise and thereby weaken the model.

Unfortunately, I couldn’t find any subsequent blog post where Ijtsma writes about the role of gamestate in the context of his R-squared analysis, so, alas, I’m here to do exactly that.

Now, my point isn’t necessarily to find the absolute best metric possible to use to forecast future team performance. Tiotal Football (with assistance from Eliot) showed that comprehensive event-based metrics like goals added (light blue), completed passes into the area in front of the box (yellow), etc. are even more predictive of future points per game than xG, at least in the MLS.

I do look at a few metrics that go beyond Ijtsma’s set in the Appendix, but that is not the primary focus of this post.

Data

I use {worldfootballR} to retrieve match-level figures and shot logs from FBref. As noted in the intro, we’re going to be working with Big 5 and MLS data for the 2018 – 2024 (season-ending) seasons.

Code Setup
## data ingestion
library(worldfootballR)
library(dplyr)
library(stringr)
library(tidyr)

## plotting
library(ggplot2)
library(sysfonts)
library(showtext)
library(ggtext)
library(htmltools)

## local dev
library(withr)
library(qs)

## parallelization for bootstrap calcs
library(parallel)
library(future)
library(furrr)

PROJ_DIR <- 'posts/xg-predictor-future-results'

## Extract the from "47880eb7" from "https://fbref.com/en/matches/47880eb7/Liverpool-Manchester-City-November-10-2019-Premier-League"
extract_fbref_match_id <- function(match_url) {
  basename(dirname(match_url))
}
Data Setup
## Discrepancies in worldfootballR's saved data
fix_team <- function(x) {
  dplyr::case_match(
    x,
    'Montreal Impact' ~ 'CF Montréal',
    'Sporting Kansas City' ~ 'Sporting KC',
    .default = x
  )
}

COUNTRY <- c('ENG', 'ESP', 'FRA', 'GER', 'ITA', 'USA')
GENDER <- 'M'
TIER <- '1st'
SEASON_END_YEAR <- c(2018:2024)

## for match-level shot data
raw_team_summary <- worldfootballR::load_fb_advanced_match_stats(
  country = COUNTRY,
  gender = GENDER,
  tier = TIER,
  season_end_year = SEASON_END_YEAR,
  stat_type = 'summary',
  team_or_player = 'team'
)
#> → Data last updated 2024-05-21 18:39:59 UTC

raw_team_passing <- worldfootballR::load_fb_advanced_match_stats(
  country = COUNTRY,
  gender = GENDER,
  tier = TIER,
  season_end_year = SEASON_END_YEAR,
  stat_type = 'passing',
  team_or_player = 'team'
)
#> → Data last updated 2024-05-21 18:39:59 UTC

## for shot-level data for gamestate calcs
raw_shots <- worldfootballR::load_fb_match_shooting(
  country = COUNTRY,
  gender = GENDER,
  tier = TIER,
  season_end_year = SEASON_END_YEAR
)
#> → Data last updated 2024-05-21 17:52:12 UTC

## for team info for gamestate calcs
raw_match_summaries <- worldfootballR::load_fb_match_summary(
  country = COUNTRY,
  gender = GENDER,
  tier = TIER,
  season_end_year = SEASON_END_YEAR
)
#> → Data last updated 2024-05-21 18:14:02 UTC
Match-level data ingestion
## Known/valid matches with missing xG
## - https://fbref.com/en/matches/e0a20cfe/Hellas-Verona-Roma-September-19-2020-Serie-A: awarded to Hellas Verona
## https://fbref.com/en/matches/c34bbc21/Bochum-Monchengladbach-March-18-2022-Bundesliga: awarded to Monchengladbach
drop_bad_matches <- function(df) {
  df |> 
    dplyr::filter(
    ## choose non-playoff/relegation weeks
    (Country != 'USA' & stringr::str_detect(Matchweek, 'Matchweek')) | (Country == 'USA' & Matchweek == 'Major League Soccer (Regular Season)')
  ) |> 
  ## Fix effectively dup records due to change in MatchURLs
  dplyr::filter(!(Country == 'GER' & Season_End_Year == 2024 & stringr::str_detect(MatchURL, 'Leverkusen') & !stringr::str_detect(MatchURL, 'Bayer-Leverkusen'))) |> 
  ##  Drop MLS 2024 since it's incomplete at time of writing
  dplyr::filter(
    !(Country == 'USA' & Season_End_Year == 2024)
  ) |> 
  ## Drop COVID-19-affected seasonf for all leagues
  ##  (Ligue 1 is espeically odd, as each team has about 7 games missing.)
  dplyr::filter(
    Season_End_Year != 2020
  )
}

team_summary <- raw_team_summary |>
  drop_bad_matches() |> 
  dplyr::transmute(
    season = Season_End_Year,
    country = Country,
    gender = Gender,
    tier = Tier,
    # match_week = stringr::str_extract(Matchweek, '[0-9]+') |> as.integer(), ## won't works for MLS
    ## We'll use this to define our own notion of match week, where we order games
    ##   by date rather than use the FBref matchweek provided at face value, since a
    ##   a match could have been rescheduled, leading to situations where the labeled
    ##   matchweek 2 comes before the matchweek 1, for example.
    date = Match_Date, 
    match_id = extract_fbref_match_id(MatchURL),
    team = Team,
    is_home = Home_Away == 'Home',
    g = Gls,
    xg = xG_Expected,
    shots = Sh,
    sot = SoT
  ) |> 
  ## passing metrics for further exploration
  dplyr::left_join(
    raw_team_passing |> 
      drop_bad_matches() |> 
      dplyr::transmute(
        match_id = extract_fbref_match_id(MatchURL),
        team = Team,
        ppa = PPA,
        xag = xAG
      ),
    by = dplyr::join_by(match_id, team)
  ) |> 
  dplyr::mutate(
    xg_xag = dplyr::coalesce(xg, 0) + dplyr::coalesce(xag, 0)
  ) |> 
  dplyr::group_by(season, team) |> 
  dplyr::mutate(
    season_game_count = dplyr::n(),
    game_idx = dplyr::row_number(date)
  ) |> 
  dplyr::ungroup()

combined_team_summary <- dplyr::left_join(
  team_summary,
  team_summary |>
    dplyr::rename_with(
      \(.x) paste0(.x, '_conceded'),
      c(
        g, xg, shots, sot,
        ppa, xag, xg_xag
      )
    ) |> 
    dplyr::select(
      match_id,
      opponent = team,
      dplyr::ends_with('conceded')
      
    ),
  by = dplyr::join_by(match_id),
  relationship = 'many-to-many'
) |> 
  dplyr::filter(team != opponent) |> 
  dplyr::mutate(
    pts = dplyr::case_when(
      g > g_conceded ~ 3L,
      g < g_conceded ~ 0L,
      g == g_conceded ~ 1L
    ),
    pts_conceded = dplyr::case_when(
      g > g_conceded ~ 0L,
      g < g_conceded ~ 3L,
      g == g_conceded ~ 1L
    )
  ) |> 
  dplyr::arrange(team, season, date)
dplyr::glimpse(combined_team_summary)
#> Rows: 25,416
#> Columns: 27
#> $ season            <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ country           <chr> "FRA", "FRA", "FRA", "FRA", "FRA", "FRA", "FRA", …
#> $ gender            <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
#> $ tier              <chr> "1st", "1st", "1st", "1st", "1st", "1st", "1st", …
#> $ date              <chr> "2022-08-05", "2022-08-14", "2022-08-21", "2022-0…
#> $ match_id          <chr> "47ab569b", "8b559e8f", "9befca9a", "d8aa49d4", "…
#> $ team              <chr> "Ajaccio", "Ajaccio", "Ajaccio", "Ajaccio", "Ajac…
#> $ is_home           <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALS…
#> $ g                 <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 4, 2, 1, 0…
#> $ xg                <dbl> 1.5, 0.5, 1.4, 1.7, 1.2, 0.5, 1.2, 0.5, 1.5, 0.2,…
#> $ shots             <dbl> 7, 9, 9, 14, 15, 16, 9, 6, 12, 6, 14, 8, 5, 9, 12…
#> $ sot               <dbl> 3, 2, 2, 2, 0, 3, 3, 4, 3, 1, 6, 1, 1, 5, 3, 2, 0…
#> $ ppa               <dbl> 6, 4, 3, 5, 7, 11, 9, 4, 5, 2, 8, 3, 5, 8, 6, 7, …
#> $ xag               <dbl> 0.4, 0.3, 0.6, 0.7, 0.7, 0.4, 1.2, 0.4, 0.7, 0.1,…
#> $ xg_xag            <dbl> 1.9, 0.8, 2.0, 2.4, 1.9, 0.9, 2.4, 0.9, 2.2, 0.3,…
#> $ season_game_count <int> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 3…
#> $ game_idx          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ opponent          <chr> "Lyon", "Lens", "Rennes", "Lille", "Montpellier",…
#> $ g_conceded        <dbl> 2, 0, 2, 3, 2, 1, 1, 0, 3, 1, 1, 3, 1, 2, 2, 0, 2…
#> $ xg_conceded       <dbl> 1.3, 1.4, 1.2, 1.7, 1.7, 0.8, 1.3, 0.5, 1.0, 1.8,…
#> $ shots_conceded    <dbl> 10, 12, 16, 6, 9, 7, 12, 8, 7, 9, 9, 11, 18, 9, 1…
#> $ sot_conceded      <dbl> 5, 3, 7, 4, 4, 4, 3, 4, 3, 1, 2, 6, 5, 5, 5, 4, 4…
#> $ ppa_conceded      <dbl> 7, 13, 5, 7, 5, 3, 5, 7, 6, 12, 7, 16, 3, 7, 5, 9…
#> $ xag_conceded      <dbl> 0.5, 1.2, 0.8, 0.7, 1.7, 0.7, 1.1, 0.3, 1.0, 0.8,…
#> $ xg_xag_conceded   <dbl> 1.8, 2.6, 2.0, 2.4, 3.4, 1.5, 2.4, 0.8, 2.0, 2.6,…
#> $ pts               <int> 0, 1, 0, 0, 0, 0, 0, 3, 0, 1, 1, 0, 0, 3, 1, 3, 0…
#> $ pts_conceded      <int> 3, 1, 3, 3, 3, 3, 3, 0, 3, 1, 1, 3, 3, 0, 1, 0, 3…
Gamestate data ingestion
match_summaries <- raw_match_summaries |> 
  dplyr::group_by(MatchURL) |> 
  dplyr::mutate(
    match_summary_rn = dplyr::row_number(dplyr::desc(Event_Time)),
    match_has_no_penalties = all(Event_Type != 'Penalty')
  ) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(
    match_has_no_goals = Away_Score == 0 & Home_Score == 0
  ) |> 
  ## Drop non-shot events, e.g. card and substitution events. 
  ##   Always keep the first timeline event, so that we're not accidentally dropping matches.
  dplyr::filter(
    Event_Type %in% c('Goal', 'Own Goal', 'Penalty') | 
      ## don't drop games with no goals
      (match_has_no_goals & match_has_no_penalties & match_summary_rn == 1)
  ) |> 
  dplyr::transmute(
    season = Season_End_Year,
    country = Country,
    gender = Gender,
    tier = Tier,
    match_week = stringr::str_extract(Matchweek, '[0-9]+') |> as.integer(),
    match_id = extract_fbref_match_id(MatchURL),
    date = lubridate::ymd(Match_Date),
    home_team = fix_team(Home_Team) ,
    away_team = fix_team(Away_Team),
    period = as.integer(Event_Half),
    ## ensure that minutes always has a value
    minutes = dplyr::case_when(
      period == 1L & Event_Time > 45L ~ 45L, 
      period == 2L & Event_Time > 90L ~ 90L,
      .default = Event_Time
    ) |> as.integer(),
    minutes_added = dplyr::case_when(
      period == 1L & Event_Time > 45 ~ Event_Time - 45L, 
      period == 2L & Event_Time > 90 ~ Event_Time - 90L,
      .default = NA_integer_
    ),
    home_g = as.integer(gsub('[:].*$', '', Score_Progression)), ## after event
    away_g = as.integer(gsub('^.*[:]', '', Score_Progression)),
    is_own_goal = Event_Type == 'Own Goal',
    team = Team,
    player = Event_Players
  )

deduped_match_summaries <- match_summaries |> 
  ## Some matches are recorded twice if they were rescheduled
  dplyr::semi_join(
    match_summaries |> 
      dplyr::distinct(match_id, date) |> 
      dplyr::group_by(match_id) |> 
      dplyr::slice_max(date, n = 1) |> 
      dplyr::ungroup()
  )

shots <- raw_shots |> 
  dplyr::transmute(
    season = Season_End_Year,
    country = Country,
    gender = Gender,
    tier = Tier,
    match_id = extract_fbref_match_id(MatchURL),
    period = as.integer(Match_Half),
    ## convert "45+2" to "45"
    minutes = ifelse(
      grepl('[+]', Minute),
      as.integer(gsub('(^[0-9]+)[+]([0-9]+$)', '\\1', Minute)), 
      as.integer(Minute)
    ),
    ## convert "45+2" to "2"
    minutes_added = ifelse(
      grepl('[+]', Minute), 
      as.integer(gsub('(^[0-9]+)[+]([0-9]+$)', '\\2', Minute)), 
      NA_integer_
    ),
    is_home = Home_Away == 'Home',
    team = fix_team(Squad),
    player = Player,
    is_goal = Outcome == 'Goal',
    is_on_target = !is.na(PSxG),
    xg = as.double(xG),
    xgot = as.double(PSxG)
  )

shots_with_own_goals <- dplyr::bind_rows(
  shots |> 
    dplyr::transmute(
      match_id,
      period,
      minutes,
      minutes_added,
      is_home,
      team,
      player,
      is_goal,
      xg,
      xgot,
      is_on_target,
      is_own_goal = FALSE
    ),
  ## synthetic events for own goals
  deduped_match_summaries |> 
    dplyr::filter(
      is_own_goal
    ) |> 
    dplyr::transmute(
      match_id,
      period,
      minutes,
      minutes_added,
      is_home = team == home_team,
      team,
      player,
      is_goal = TRUE,
      xg = NA_real_,
      xgot = NA_real_,
      is_on_target = NA,
      is_own_goal = TRUE
    )
)

clean_shots <- shots_with_own_goals |> 
  ## To get meta-information about the game
  dplyr::inner_join(
    deduped_match_summaries |>
      dplyr::distinct(match_id, home_team, away_team),
    by = dplyr::join_by(match_id),
    relationship = 'many-to-one'
  ) |> 
  dplyr::mutate(
    home_g = dplyr::case_when(
      ## Note that fotmob would list the away team for an own goal but fbref 
      ##   lists the home team
      (is_goal | is_own_goal) & is_home ~ 1L,
      is_own_goal & is_home ~ 1L,
      TRUE ~ 0L
    ),
    away_g = dplyr::case_when(
      (is_goal | is_own_goal) & !is_home ~ 1L,
      TRUE ~ 0L
    ),
    home_xg = dplyr::case_when(
      is_home ~ dplyr::coalesce(xg, 0),
      TRUE ~ 0 ## even for own goals
    ),
    away_xg = dplyr::case_when(
      !is_home ~ dplyr::coalesce(xg, 0),
      TRUE ~ 0
    ),
    home_xgot = dplyr::case_when(
      is_home ~ dplyr::coalesce(xgot, 0),
      TRUE ~ 0
    ),
    away_xgot = dplyr::case_when(
      !is_home ~ dplyr::coalesce(xgot, 0),
      TRUE ~ 0
    )
  ) |>
  dplyr::select(
    match_id,
    period, minutes, minutes_added,
    is_home, is_goal, is_on_target, is_own_goal,
    player,
    home_team, away_team,
    home_g, away_g,
    home_xg, away_xg,
    home_xgot, away_xgot
  ) |> 
  ## Still need to distinct because, even with fixing team names, there have been FBref MatchURL changes
  dplyr::distinct() |> 
  dplyr::group_by(match_id) |>
  ## Differentiate between shots in the same minute.
  dplyr::mutate(
    shot_idx = dplyr::row_number((minutes + dplyr::coalesce(minutes_added, 0L)))
  ) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(
    shot_id = sprintf('%s-%02d', match_id, shot_idx),
    .before = 1
  )

restacked_shots <- dplyr::bind_rows(
  clean_shots |> 
    dplyr::filter(is_home) |> 
    dplyr::transmute(
      shot_id, match_id,
      period, minutes, minutes_added,
      is_home, is_goal, is_on_target, is_own_goal,
      player,
      team = home_team,
      opponent = away_team,
      g = home_g,
      g_conceded = away_g,
      xg = home_xg,
      xg_conceded = away_xg,
      xgot = home_xgot,
      xgot_conceded = away_xgot
    ),
  clean_shots |> 
    dplyr::filter(!is_home) |> 
    dplyr::transmute(
      shot_id, match_id,
      period, minutes, minutes_added,
      is_home, is_goal, is_on_target, is_own_goal,
      player,
      team = away_team,
      opponent = home_team,
      g = away_g,
      g_conceded = home_g,
      xg = away_xg,
      xg_conceded = home_xg,
      xgot = away_xgot,
      xgot_conceded = home_xgot
    )
)

doublecounted_restacked_shots <- dplyr::bind_rows(
  restacked_shots |> dplyr::mutate(pov = 'primary', .before = 1),
  restacked_shots |> 
    ## re-assign to temporary variable names first, so that way we don't accidentlaly overwrite information
    dplyr::rename(
      team1 = team,
      team2 = opponent,
      g1 = g,
      g2 = g_conceded,
      xg1 = xg,
      xg2 = xg_conceded,
      xgot1 = xgot,
      xgot2 = xgot_conceded
    ) |> 
    ## then formally re-assign columns
    dplyr::rename(
      team = team2,
      opponent = team1,
      g = g2,
      g_conceded = g1,
      xg = xg2,
      xg_conceded = xg1,
      xgot = xgot2,
      xgot_conceded = xgot1
    ) |> 
    dplyr::mutate(
      is_home = !is_home
    ) |> 
    dplyr::mutate(
      pov = 'secondary',
      .before = 1
    )
) |> 
  dplyr::arrange(shot_id, pov)

cumu_doublecounted_restacked_shots <- doublecounted_restacked_shots |> 
  dplyr::group_by(
    match_id, 
    team
  ) |> 
  dplyr::mutate(
    dplyr::across(
      c(g, g_conceded),
      list(cumu = cumsum)
    )
  ) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(
    gamestate = g_cumu - g_conceded_cumu
  )

gamestate_shots <- cumu_doublecounted_restacked_shots |> 
  dplyr::inner_join(
    deduped_match_summaries |> 
      dplyr::distinct(
        match_id,
        season, country, gender, tier,
        match_week, date,
        home_team, away_team
      ),
    by = dplyr::join_by(match_id)
  ) |> 
  dplyr::transmute(
    pov,
    match_id,
    season, country, gender, tier,
    match_week, date,
    home_team, away_team,
    team, player,
    shot_id,
    period, minutes, minutes_added,
    time = minutes + dplyr::coalesce(minutes_added, 0L),
    g, g_conceded,
    xg, xg_conceded,
    xgot, xgot_conceded,
    xgd = xg - xg_conceded,
    gamestate_gd0 = dplyr::case_when(
      gamestate == 0 ~ 'neutral',
      gamestate < 0 ~ 'trailing',
      gamestate > 0 ~ 'leading'
    ),
    gamestate_abs_gd1 = dplyr::case_when(
      abs(gamestate) <= 1 ~ 'neutral',
      gamestate < 1 ~ 'trailing',
      gamestate > 1 ~ 'leading'
    )
  ) |> 
  dplyr::group_by(match_id, team) |> 
  dplyr::arrange(shot_id, .by_group = TRUE) |> 
  dplyr::mutate(
    pre_shot_gamestate_gd0 = dplyr::lag(gamestate_gd0, default = 'neutral'),
    pre_shot_gamestate_abs_gd1 = dplyr::lag(gamestate_abs_gd1, default = 'neutral')
  ) |> 
  dplyr::ungroup() |> 
  dplyr::arrange(team, season, date, shot_id)

LAST_MIN_BUFFER <- 3
last_min_pad <- gamestate_shots |>
  dplyr::select(
    match_id,
    season,
    country,
    gender,
    tier,
    match_week,
    date,
    team,
    pre_shot_gamestate_gd0,
    pre_shot_gamestate_abs_gd1,
    period,
    time
  ) |> 
  dplyr::group_by(match_id, team, period) |>
  dplyr::slice_max(time, n = 1, with_ties = FALSE) |>
  dplyr::ungroup() |>
  dplyr::mutate(
    xg = 0,
    xg_conceded = 0,
    xgot = 0,
    xgot_conceded = 0,
    last_regular_min = ifelse(period == 1L, 45L, 90L),
    time = pmax(last_regular_min + LAST_MIN_BUFFER, time + 1)
  )

padded_gamestate_shots <- dplyr::bind_rows(
  gamestate_shots,
  last_min_pad
) |> 
  dplyr::arrange(match_id, period, time)

gamestate_shots_and_durations <- padded_gamestate_shots |> 
  dplyr::group_by(match_id, team) |> 
  dplyr::mutate(
    prev_period = dplyr::lag(period),
    prev_time = dplyr::lag(time)
  ) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(
    duration = dplyr::case_when(
      period == 1L & is.na(prev_period) ~ time - 0L,
      period == 2L & period != prev_period ~ time - 45L,
      TRUE ~ time - prev_time
    )
  )

aggregate_games_by_gamestate <- function(df, gamestate_col) {
  gamestate_sym <- rlang::ensym(gamestate_col)
  agg_shots_by_game_gamestate <- df |> 
    dplyr::group_by(
      country, gender, tier,
      team, 
      season,
      match_id, date,
      !!gamestate_sym,
    ) |>
    ## need all the `na.rm = TRUE`'s since i don't fill in `pov` for the padded minute records
    dplyr::summarize(
      shots = sum(pov == 'primary', na.rm = TRUE),
      shots_conceded = sum(pov == 'secondary', na.rm = TRUE),
      sot = sum(pov == 'primary' & xgot > 0, na.rm = TRUE),
      sot_conceded = sum(pov == 'primary' & xgot_conceded > 0, na.rm = TRUE),
      dplyr::across(
        c(
          g, g_conceded, xg, xg_conceded, xgot, xgot_conceded,
          duration
        ),
        \(.x) sum(.x, na.rm = TRUE)
      )
    ) |> 
    dplyr::ungroup() |> 
    dplyr::arrange(
      country, gender, tier, season,
      team, 
      season, date,
      !!gamestate_sym
    )
  
  agg_shots_by_game_gamestate |> 
    dplyr::distinct(
      country, gender, tier, season,
      team,
      date, match_id
    ) |> 
    tidyr::crossing(
      !!gamestate_sym := c('trailing', 'neutral', 'leading')
    ) |> 
    dplyr::left_join(
      agg_shots_by_game_gamestate,
      by = dplyr::join_by(
        country, gender, tier, season,
        team, 
        !!gamestate_sym,
        date, match_id
      )
    ) |> 
    dplyr::mutate(
      dplyr::across(
        c(
          shots, shots_conceded,
          sot, sot_conceded,
          g, g_conceded, 
          xg, xg_conceded, 
          xgot, xgot_conceded,
          duration
        ),
        \(.x) dplyr::coalesce(.x, 0)
      )
    ) |> 
    tidyr::pivot_wider(
      names_from = !!gamestate_sym,
      values_from = c(
        shots, shots_conceded,
        sot, sot_conceded,
        g,  g_conceded, 
        xg, xg_conceded, 
        xgot, xgot_conceded,
        duration
      ),
      names_sort = TRUE
    )
}

## output has the same format as gamestate_abs_gd1_by_game
gamestate_gd0_by_game <- gamestate_shots_and_durations |> 
  aggregate_games_by_gamestate('pre_shot_gamestate_gd0')

gamestate_abs_gd1_by_game <- gamestate_shots_and_durations |> 
  aggregate_games_by_gamestate('pre_shot_gamestate_abs_gd1')
dplyr::glimpse(gamestate_abs_gd1_by_game)
#> Rows: 29,852
#> Columns: 40
#> $ country                 <chr> "ENG", "ENG", "ENG", "ENG", "ENG", "ENG", "…
#> $ gender                  <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M"…
#> $ tier                    <chr> "1st", "1st", "1st", "1st", "1st", "1st", "…
#> $ team                    <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal",…
#> $ season                  <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
#> $ date                    <date> 2017-08-11, 2017-08-19, 2017-08-27, 2017-0…
#> $ match_id                <chr> "e3c3ddf0", "37f97732", "e4374144", "2fab7d…
#> $ shots_leading           <dbl> 0, 0, 0, 11, 0, 8, 9, 0, 8, 0, 0, 10, 0, 8,…
#> $ shots_neutral           <dbl> 27, 18, 1, 6, 11, 8, 16, 9, 22, 17, 4, 4, 1…
#> $ shots_trailing          <dbl> 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2…
#> $ shots_conceded_leading  <dbl> 0, 0, 0, 7, 0, 0, 4, 0, 3, 0, 0, 10, 0, 1, …
#> $ shots_conceded_neutral  <dbl> 6, 11, 10, 0, 13, 7, 5, 11, 6, 5, 9, 4, 8, …
#> $ shots_conceded_trailing <dbl> 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
#> $ sot_leading             <dbl> 0, 0, 0, 5, 0, 2, 1, 0, 3, 0, 0, 3, 0, 4, 0…
#> $ sot_neutral             <dbl> 10, 6, 0, 4, 2, 4, 7, 6, 11, 5, 1, 2, 2, 2,…
#> $ sot_trailing            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1…
#> $ sot_conceded_leading    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ sot_conceded_neutral    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ sot_conceded_trailing   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ g_leading               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3, 0…
#> $ g_neutral               <dbl> 4, 0, 0, 2, 0, 2, 2, 1, 3, 2, 0, 2, 1, 2, 0…
#> $ g_trailing              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
#> $ g_conceded_leading      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ g_conceded_neutral      <dbl> 3, 1, 2, 0, 0, 0, 0, 2, 1, 1, 3, 0, 0, 0, 3…
#> $ g_conceded_trailing     <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ xg_leading              <dbl> 0.00, 0.00, 0.00, 1.57, 0.00, 0.65, 0.72, 0…
#> $ xg_neutral              <dbl> 2.57, 1.46, 0.09, 0.66, 1.53, 1.56, 1.71, 0…
#> $ xg_trailing             <dbl> 0.00, 0.00, 0.48, 0.00, 0.00, 0.00, 0.00, 0…
#> $ xg_conceded_leading     <dbl> 0.00, 0.00, 0.00, 0.57, 0.00, 0.00, 0.21, 0…
#> $ xg_conceded_neutral     <dbl> 1.50, 0.71, 1.65, 0.00, 0.79, 0.88, 0.20, 1…
#> $ xg_conceded_trailing    <dbl> 0.00, 0.00, 1.52, 0.00, 0.00, 0.00, 0.00, 0…
#> $ xgot_leading            <dbl> 0.00, 0.00, 0.00, 3.06, 0.00, 0.23, 0.07, 0…
#> $ xgot_neutral            <dbl> 3.34, 1.09, 0.00, 1.72, 0.18, 2.01, 2.67, 1…
#> $ xgot_trailing           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
#> $ xgot_conceded_leading   <dbl> 0.00, 0.00, 0.00, 0.63, 0.00, 0.00, 0.00, 0…
#> $ xgot_conceded_neutral   <dbl> 2.11, 1.02, 2.21, 0.00, 0.53, 0.26, 0.08, 1…
#> $ xgot_conceded_trailing  <dbl> 0.0, 0.0, 2.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
#> $ duration_leading        <dbl> 0, 0, 0, 70, 0, 26, 38, 0, 22, 0, 0, 57, 0,…
#> $ duration_neutral        <dbl> 99, 99, 40, 27, 96, 70, 59, 98, 77, 96, 81,…
#> $ duration_trailing       <dbl> 0, 0, 56, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0,…

Methods and Analysis

Replicating prior art

Let’s begin with replicating Ijtsma’s visualization where he evaluated the running R-squared of the five measures of past performance with respect to two measures of future performance. The five measures of past performance are:

  1. points per game
  2. goal ratio
  3. total shots ratio
  4. shots on target ratio
  5. xG ratio

The two measures of future performance are:

  1. goals ratio
  2. points per game
Calculate rolling R-squared for season-to-date and rest-of-season performance measures
accumulate_team_summary <- function(df, op, .prefix) {
  df |> 
    dplyr::arrange(team, season, op(game_idx)) |> 
    dplyr::group_by(season, team) |> 
    dplyr::mutate(
      dplyr::across(
        c(
          shots, shots_conceded, 
          sot, sot_conceded,
          g, g_conceded, 
          xg, xg_conceded,
          xgot, xgot_conceded,
          
          g_neutral, g_conceded_neutral, 
          xg_neutral, xg_conceded_neutral, 
          xgot_neutral, xgot_conceded_neutral,
          
          ppa, ppa_conceded, 
          xag, xag_conceded,
          xg_xag, xg_xag_conceded,
          
          duration_leading,
          pts, pts_conceded
        ),
        \(.x) cumsum(dplyr::coalesce(.x, 0))
      )
    ) |> 
    dplyr::ungroup() |> 
    dplyr::mutate(
      shot_ratio = shots / (shots + shots_conceded),
      sot_ratio = sot / (sot + sot_conceded),
      g_ratio = g / (g + g_conceded),
      xg_ratio = xg / (xg + xg_conceded),
      xgot_ratio = xgot / (xgot + xgot_conceded),
      
      shot_neutral_ratio = shots_neutral / (shots_neutral + shots_conceded_neutral),
      sot_neutral_ratio = sot_neutral / (sot_neutral + sot_conceded_neutral),
      g_neutral_ratio = g_neutral / (g_neutral + g_conceded_neutral),
      xg_neutral_ratio = xg_neutral / (xg_neutral + xg_conceded_neutral),
      xgot_neutral_ratio = xgot_neutral / (xgot_neutral + xgot_conceded_neutral),

      ppa_ratio = ppa / (ppa + ppa_conceded),
      xag_ratio = xag / (xag + xag_conceded),
      xg_xag_ratio = xg_xag / (xg_xag + xg_xag_conceded),
      
      duration_leading_per_game = duration_leading / game_idx,
      ppg = pts / game_idx
    ) |> 
    dplyr::mutate(
      ## replace NaNs with NAs for the cor calculation
      dplyr::across(
        dplyr::ends_with('ratio'),
        \(.x) tidyr::replace_na(.x, 0.5)
      )
    ) |> 
    dplyr::rename_with(
      .fn = \(.x) paste0(.prefix, '_', .x),
      .cols = c(
        shots, shots_conceded, 
        sot, sot_conceded,
        g, g_conceded, 
        xg, xg_conceded,
        xgot, xgot_conceded,
        
        shots_neutral, shots_conceded_neutral, 
        sot_neutral, sot_conceded_neutral,
        g_neutral, g_conceded_neutral, 
        xg_neutral, xg_conceded_neutral, 
        xgot_neutral, xgot_conceded_neutral,
        
        ppa, ppa_conceded, 
        xag, xag_conceded,
        xg_xag, xg_xag_conceded,
        
        duration_leading,
        pts, pts_conceded,
        
        dplyr::ends_with('ratio'),
        
        duration_leading_per_game,
        ppg
      )
    )
}

calculate_nested_r2 <- function(data, col, target_col) {
  purrr::map_dbl(
    data,
    \(.x) {
      cor(
        .x[[col]],
        .x[[target_col]],
        use = 'complete.obs'
      )^2
    }
  )
}

calculate_rolling_r2s <- function(combined_df) {
  
  past_team_summary <- combined_df |> 
    accumulate_team_summary(`+`, .prefix = 'past')
  
  future_team_summary <- combined_df |> 
    accumulate_team_summary(`-`, .prefix = 'future')
  
  accumulated_df <- dplyr::inner_join(
    past_team_summary |> 
      dplyr::select(
        season,
        team,
        country,
        game_idx,
        dplyr::starts_with('past')
      ),
    future_team_summary |> 
      dplyr::select(
        season,
        team,
        game_idx,
        dplyr::starts_with('future')
      ),
    by = dplyr::join_by(season, team, game_idx)
  ) |> 
    dplyr::mutate(
      league_group = ifelse(country == 'USA', 'MLS', 'Big 5'),
      .keep = 'unused',
      .before = 1
    )
  
  accumulated_df |> 
    tidyr::nest(data = -c(league_group, game_idx)) |> 
    dplyr::mutate(
      past_shot_ratio__future_g_ratio = calculate_nested_r2(data, 'past_shot_ratio', 'future_g_ratio'),
      past_sot_ratio__future_g_ratio = calculate_nested_r2(data, 'past_sot_ratio', 'future_g_ratio'),
      past_g_ratio__future_g_ratio = calculate_nested_r2(data, 'past_g_ratio', 'future_g_ratio'),
      past_xg_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xg_ratio', 'future_g_ratio'),
      past_xgot_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xgot_ratio', 'future_g_ratio'),
      
      past_shot_neutral_ratio__future_g_ratio = calculate_nested_r2(data, 'past_shot_neutral_ratio', 'future_g_ratio'),
      past_sot_neutral_ratio__future_g_ratio = calculate_nested_r2(data, 'past_sot_neutral_ratio', 'future_g_ratio'),
      past_g_neutral_ratio__future_g_ratio = calculate_nested_r2(data, 'past_g_neutral_ratio', 'future_g_ratio'),
      past_xg_neutral_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xg_neutral_ratio', 'future_g_ratio'),
      past_xgot_neutral_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xgot_neutral_ratio', 'future_g_ratio'),
      
      ## Bonus 1: Non-gamestate, non-shot features
      past_ppa_ratio__future_g_ratio = calculate_nested_r2(data, 'past_ppa_ratio', 'future_g_ratio'),
      past_xag_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xag_ratio', 'future_g_ratio'),
      past_xg_xag_ratio__future_g_ratio = calculate_nested_r2(data, 'past_xg_xag_ratio', 'future_g_ratio'),
      
      ## Bonus 2: duration leading per game
      past_duration_leading_per_game__future_g_ratio = calculate_nested_r2(data, 'past_duration_leading_per_game', 'future_g_ratio'),
      past_ppg__future_g_ratio = calculate_nested_r2(data, 'past_ppg', 'future_g_ratio'),
      
      past_shot_ratio__future_ppg = calculate_nested_r2(data, 'past_shot_ratio', 'future_ppg'),
      past_sot_ratio__future_ppg = calculate_nested_r2(data, 'past_sot_ratio', 'future_ppg'),
      past_g_ratio__future_ppg = calculate_nested_r2(data, 'past_g_ratio', 'future_ppg'),
      past_xg_ratio__future_ppg = calculate_nested_r2(data, 'past_xg_ratio', 'future_ppg'),
      past_xgot_ratio__future_ppg = calculate_nested_r2(data, 'past_xgot_ratio', 'future_ppg'),
      
      past_shot_neutral_ratio__future_ppg = calculate_nested_r2(data, 'past_shot_neutral_ratio', 'future_ppg'),
      past_sot_neutral_ratio__future_ppg = calculate_nested_r2(data, 'past_sot_neutral_ratio', 'future_ppg'),
      past_g_neutral_ratio__future_ppg = calculate_nested_r2(data, 'past_g_neutral_ratio', 'future_ppg'),
      past_xg_neutral_ratio__future_ppg = calculate_nested_r2(data, 'past_xg_neutral_ratio', 'future_ppg'),
      past_xgot_neutral_ratio__future_ppg = calculate_nested_r2(data, 'past_xgot_neutral_ratio', 'future_ppg'),
      
      past_ppa_ratio__future_ppg = calculate_nested_r2(data, 'past_ppa_ratio', 'future_ppg'),
      past_xag_ratio__future_ppg = calculate_nested_r2(data, 'past_xag_ratio', 'future_ppg'),
      past_xg_xag_ratio__future_ppg = calculate_nested_r2(data, 'past_xg_xag_ratio', 'future_ppg'),
      
      past_duration_leading_per_game__future_ppg = calculate_nested_r2(data, 'past_duration_leading_per_game', 'future_ppg'),
      past_ppg__future_ppg = calculate_nested_r2(data, 'past_ppg', 'future_ppg')
    ) |> 
    dplyr::select(-data)
}

pivot_rolling_r2s <- function(df) {
  df |> 
    tidyr::pivot_longer(
      -c(league_group, game_idx),
      names_pattern = '(^.*)__(.*$)',
      names_to = c('predictor', 'target'),
      values_to = 'r2'
    )
}

do_calculate_rolling_r2s <- purrr::compose( 
  calculate_rolling_r2s,
  pivot_rolling_r2s,
  .dir = 'forward'
)

join_team_summary_and_gamestate_dfs <- function(team_summary_df, gamestate_df) {
  team_summary_df |> 
    dplyr::left_join(
      gamestate_df |> dplyr::select(-c(country, gender, tier, date)),
      by = dplyr::join_by(season, match_id, team)
    ) |> 
    ## since we didn't have xgot at the match-level, we need to create it from the shot-level data
    dplyr::mutate(
      xgot = xgot_trailing + xgot_neutral + xgot_leading,
      xgot_conceded = xgot_conceded_trailing + xgot_conceded_neutral + xgot_conceded_leading
    )
}

combined_df_gd0 <- join_team_summary_and_gamestate_dfs(
  combined_team_summary,
  gamestate_gd0_by_game
)
combined_df_abs_gd1 <- join_team_summary_and_gamestate_dfs(
  combined_team_summary,
  gamestate_abs_gd1_by_game
)

## both dfs have the same format
rolling_r2s_gd0 <- combined_df_gd0 |> do_calculate_rolling_r2s()
rolling_r2s_abs_gd1 <- combined_df_abs_gd1 |> do_calculate_rolling_r2s()
rolling_r2s_abs_gd1
#> # A tibble: 2,088 × 5
#>    league_group game_idx predictor               target             r2
#>    <chr>           <int> <chr>                   <chr>           <dbl>
#>  1 Big 5               1 past_shot_ratio         future_g_ratio 0.187 
#>  2 Big 5               1 past_sot_ratio          future_g_ratio 0.170 
#>  3 Big 5               1 past_g_ratio            future_g_ratio 0.165 
#>  4 Big 5               1 past_xg_ratio           future_g_ratio 0.182 
#>  5 Big 5               1 past_xgot_ratio         future_g_ratio 0.158 
#>  6 Big 5               1 past_shot_neutral_ratio future_g_ratio 0.188 
#>  7 Big 5               1 past_sot_neutral_ratio  future_g_ratio 0.0206
#>  8 Big 5               1 past_g_neutral_ratio    future_g_ratio 0.166 
#>  9 Big 5               1 past_xg_neutral_ratio   future_g_ratio 0.183 
#> 10 Big 5               1 past_xgot_neutral_ratio future_g_ratio 0.155 
#> # ℹ 2,294 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Plotting the R-squared values
TAG_LABEL <- htmltools::tagList(
  htmltools::tags$span(htmltools::HTML(enc2utf8("")), style = 'font-family:fb'),
  htmltools::tags$span("@TonyElHabr"),
)
CAPTION_LABEL <- '**Data**: Opta via fbref. 2018 - 2024 seasons, excluding 2020.<br/>**Definitions**: Ratio = team value / (team value + opponent value).<br/>**Inspiration**: 11tegen, American Soccer Analysis'

PLOT_RESOLUTION <- 300
WHITISH_FOREGROUND_COLOR <- 'white'
COMPLEMENTARY_FOREGROUND_COLOR <- '#cbcbcb' # '#f1f1f1'
BLACKISH_BACKGROUND_COLOR <- '#1c1c1c'
COMPLEMENTARY_BACKGROUND_COLOR <- '#4d4d4d'
FONT <- 'Titillium Web'
sysfonts::font_add_google(FONT, FONT)
## https://github.com/tashapiro/tanya-data-viz/blob/main/chatgpt-lensa/chatgpt-lensa.R for twitter logo
sysfonts::font_add('fb', 'Font Awesome 6 Brands-Regular-400.otf')
showtext::showtext_auto()
showtext::showtext_opts(dpi = PLOT_RESOLUTION)

ggplot2::theme_set(ggplot2::theme_minimal())
ggplot2::theme_update(
  text = ggplot2::element_text(family = FONT),
  title = ggplot2::element_text(size = 16, color = WHITISH_FOREGROUND_COLOR),
  plot.title = ggtext::element_markdown(face = 'bold', size = 16, color = WHITISH_FOREGROUND_COLOR),
  plot.title.position = 'plot',
  plot.subtitle = ggtext::element_markdown(size = 16, color = COMPLEMENTARY_FOREGROUND_COLOR),
  axis.text = ggplot2::element_text(color = WHITISH_FOREGROUND_COLOR, size = 11),
  legend.text = ggplot2::element_text(size = 12, color = WHITISH_FOREGROUND_COLOR, face = 'plain'),
  legend.title = ggplot2::element_text(size = 12, color = WHITISH_FOREGROUND_COLOR, face = 'bold'),
  axis.title.x = ggtext::element_markdown(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0.99),
  axis.title.y = ggtext::element_markdown(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0.99),
  axis.ticks = ggplot2::element_line(color = WHITISH_FOREGROUND_COLOR),
  axis.line = ggplot2::element_blank(),
  strip.text = ggplot2::element_text(size = 14, color = WHITISH_FOREGROUND_COLOR, face = 'bold', hjust = 0),
  panel.grid.major = ggplot2::element_line(color = COMPLEMENTARY_BACKGROUND_COLOR),
  panel.grid.minor = ggplot2::element_line(color = COMPLEMENTARY_BACKGROUND_COLOR),
  panel.grid.minor.x = ggplot2::element_blank(),
  panel.grid.minor.y = ggplot2::element_blank(),
  plot.margin = ggplot2::margin(10, 20, 10, 20),
  plot.background = ggplot2::element_rect(fill = BLACKISH_BACKGROUND_COLOR, color = BLACKISH_BACKGROUND_COLOR),
  plot.caption = ggtext::element_markdown(color = WHITISH_FOREGROUND_COLOR, hjust = 0, size = 10, face = 'plain', lineheight = 1.1),
  plot.caption.position = 'plot',
  plot.tag = ggtext::element_markdown(size = 10, color = WHITISH_FOREGROUND_COLOR, hjust = 1),
  plot.tag.position = c(0.99, 0.01),
  panel.spacing.x = grid::unit(2, 'lines'),
  panel.spacing.y = grid::unit(1, 'lines'),
  # panel.background = ggplot2::element_rect(fill = BLACKISH_BACKGROUND_COLOR, color = BLACKISH_BACKGROUND_COLOR)
  panel.background = ggplot2::element_rect(fill = BLACKISH_BACKGROUND_COLOR, color = WHITISH_FOREGROUND_COLOR)
)

BASE_PRETTY_PREDICTOR_NAMES <- c(
  'past_ppg' = 'Points per Game',
  'past_g_ratio' = 'Goal Ratio',
  'past_shot_ratio' = 'Total Shots Ratio',
  'past_sot_ratio' = 'Shots on Target Ratio',
  'past_xg_ratio' = 'xG Ratio'
)
APPENDIX_PRETTY_PREDICTOR_NAMES <- c(
  'past_ppa_ratio' = 'Passes into Penalty Area Ratio',
  'past_xgot_ratio' = 'xGOT Ratio',
  # 'past_xag_ratio' = 'xAG Ratio',
  'past_xg_xag_ratio' = 'xG+xAG Ratio',
  'past_xg_ratio' = 'xG Ratio'
)
PRETTY_TARGET_NAMES <- c(
  'future_ppg' = 'Future Points per Game',
  'future_g_ratio' = 'Future Goals Ratio'
)
BASE_PREDICTOR_PALETTE <- c(
  'Points per Game' = '#cad2c5',
  'Goal Ratio' = '#8bc34a',
  'Total Shots Ratio' = '#ffc107',
  'Shots on Target Ratio' = '#448aff',
  'xG Ratio' = '#f44336'
)
APPENDIX_PREDICTOR_PALETTE <- c(
  'Passes into Penalty Area Ratio' = '#dbf679',
  'xGOT Ratio' = '#2ec4b6',
  # 'xAG Ratio' = '#e0aaff', # '#7b2cbf'
  'xG+xAG Ratio' = '#e0aaff',
  'xG Ratio' = '#f44336'
)

CROSSHAIR_LABEL <- '<i>Crosshairs mark first match week where R<sup>2</sup>>0.5.</i>'
plotting_rolling_r2s <- function(rolling_r2s, pretty_predictor_names, predictor_palette) {
  prettified_rolling_r2s <- rolling_r2s |> 
    dplyr::filter(
      predictor %in% names(pretty_predictor_names)
    ) |> 
    dplyr::mutate(
      pretty_predictor = factor(pretty_predictor_names[predictor], rev(pretty_predictor_names)),
      pretty_target = PRETTY_TARGET_NAMES[target]
    ) |> 
    ## put xG on top
    dplyr::arrange(league_group, pretty_target, dplyr::desc(pretty_predictor))
  
  min_prettified_rolling_r2s <- prettified_rolling_r2s |> 
    dplyr::filter(r2 > 0.5) |> 
    dplyr::group_by(league_group, pretty_target, pretty_predictor) |> 
    dplyr::slice_min(game_idx, n = 1) |> 
    dplyr::ungroup()
  
  prettified_rolling_r2s |> 
    ggplot2::ggplot() +
    ggplot2::aes(
      x = game_idx,
      y = r2
    ) +
    ggplot2::geom_segment(
      data = min_prettified_rolling_r2s,
      ggplot2::aes(
        color = pretty_predictor,
        x = 0,
        xend = game_idx,
        y = r2,
        yend = r2
      ),
      linetype = 2,
      linewidth = 0.5
    ) +
    ggplot2::geom_segment(
      data = min_prettified_rolling_r2s,
      ggplot2::aes(
        color = pretty_predictor,
        x = game_idx,
        xend = game_idx,
        y = 0,
        yend = r2
      ),
      linetype = 2,
      linewidth = 0.5
    ) +
    ggplot2::geom_line(
      linewidth = 0.75,
      ggplot2::aes(color = pretty_predictor)
    ) +
    ggplot2::scale_color_manual(
      values = predictor_palette
    ) +
    ggplot2::guides(
      color = ggplot2::guide_legend(
        title = NULL,
        position = 'inside',
        label.theme = ggplot2::element_text(color = WHITISH_FOREGROUND_COLOR, size = 11, FONT),
        override.aes = list(linewidth = 2)
      )
    ) +
    ggplot2::scale_y_continuous(
      expand = c(0, 0),
      limits = c(0, 1),
      breaks = seq(0, 1, by = 0.2),
      labels = scales::number_format(accuracy = 0.1)
    ) +
    ggplot2::scale_x_continuous(
      expand = c(0, 0),
      limits = c(0, 38),
      breaks = seq.int(0, 35, by = 5),
      labels = seq.int(0, 35, by = 5)
    ) +
    ggtext::geom_richtext(
      data = tibble::tibble(
        league_group = 'Big 5',
        pretty_target = sort(unname(PRETTY_TARGET_NAMES))[1]
      ),
      ggplot2::aes(
        x = 1,
        y = 0.7,
        label = CROSSHAIR_LABEL
      ),
      fill = NA, label.color = NA,
      label.padding = grid::unit(rep(0, 4), 'pt'),
      color = WHITISH_FOREGROUND_COLOR,
      family = FONT,
      size = 10 / .pt,
      hjust = 0,
      vjust = 0.5
    ) +
    ggplot2::theme(
      legend.position.inside = c(0.82, 0.35),
      legend.key.spacing.y = ggplot2::unit(-4, 'pt'),
      panel.grid.major = ggplot2::element_blank()
    ) +
    ggplot2::facet_grid(league_group~pretty_target) +
    ggplot2::labs(
      title = 'Predictiveness of Season-to-Date Metrics on Rest-of-Season Performance',
      x = 'Match Day',
      y = 'R<sup>2</sup>',
      caption = CAPTION_LABEL,
      tag = TAG_LABEL
    )
}

## Can pick either rolling_r2s_gd0 or rolling_r2s_abs_gd1 to plot since we're looking at just
##   non-gamestate features here
rolling_r2s_plot <- rolling_r2s_abs_gd1 |> 
  plotting_rolling_r2s(
    pretty_predictor_names = BASE_PRETTY_PREDICTOR_NAMES, 
    predictor_palette = BASE_PREDICTOR_PALETTE
  )

ggplot2::ggsave(
  rolling_r2s_plot,
  filename = file.path(PROJ_DIR, 'rolling-r2s.png'),
  width = 9,
  height = 9 / 1.5
)

There’s a bit of visual noise here that can make it hard to differentiate traces. This can be smoothed out by resampling.

Specifically, if we randomly reorder matchweeks before calculating cumulative in-season measures, and do that 100 times, we end up with a plot that looks like this.

Calculating bootstrapped R-squared values
calculate_resampled_rolling_r2s <- function(
    combined_df, 
    resamples = 10, 
    seed = 42,
    parallel_seed = 117, 
    cores_prop = 0.5
  ) {

  match_idx_grid <- combined_df |> dplyr::distinct(season, country, gender, tier, game_idx)
  n_cores <- parallel::detectCores()
  cores_for_parallel <- ceiling(n_cores * cores_prop)
  future::plan(
    future::multisession,
    workers = cores_for_parallel
  )
  withr::local_seed(seed)
  resampled_rolling_df <- furrr::future_map(
    1:resamples,
    .progress = TRUE,
    .options = furrr::furrr_options(
      seed = parallel_seed
    ),
    \(.i) {
      resampled_match_idx_grid <- match_idx_grid |> 
        dplyr::rename(orig_game_idx = game_idx) |> 
        dplyr::slice_sample(n = nrow(match_idx_grid), replace = FALSE) |> 
        dplyr::group_by(country, gender, tier, season) |>
        dplyr::mutate(
          game_idx = dplyr::row_number(),
        ) |> 
        dplyr::ungroup()
      
      combined_df |> 
        dplyr::rename(orig_game_idx = game_idx) |> 
        dplyr::left_join(
          resampled_match_idx_grid,
          by = dplyr::join_by(season, country, gender, tier, orig_game_idx)
        ) |> 
        calculate_rolling_r2s()
    }
  )  |> 
    purrr::list_rbind()
  future::plan(future::sequential)
  
  resampled_rolling_df |> 
    dplyr::summarize(
      .by = c(league_group, game_idx),
      dplyr::across(
        dplyr::matches('__'),
        \(.x) mean(.x, na.rm = TRUE)
      )
    )
}

do_calculate_resampled_rolling_r2s <- purrr::compose( 
  calculate_resampled_rolling_r2s,
  pivot_rolling_r2s,
  .dir = 'forward'
)

N_RESAMPLES <- 100
resampled_rolling_r2s_gd0 <- combined_df_gd0 |>
  do_calculate_resampled_rolling_r2s(resamples = N_RESAMPLES)
resampled_rolling_r2s_abs_gd1 <- combined_df_abs_gd1 |> 
  do_calculate_resampled_rolling_r2s(resamples = N_RESAMPLES)
Plotting the bootstrapped R-squared values
BOOTSTRAPPED_CAPTION_LABEL <- stringr::str_replace(
  CAPTION_LABEL, 
  '<br\\/>', 
  glue::glue(' Data resampled by match day {N_RESAMPLES} times.<br/>')
)
relabel_for_bootstrap_plot <- function(...) {
  list(
    ...,
    ggplot2::labs(
      y = 'Average Bootstrapped R<sup>2</sup>',
      caption = BOOTSTRAPPED_CAPTION_LABEL,
      x = 'Randomized Match Day'
    )
  )
}

resampled_rolling_r2s_plot <- resampled_rolling_r2s_gd0 |> 
  plotting_rolling_r2s(
    pretty_predictor_names = BASE_PRETTY_PREDICTOR_NAMES, 
    predictor_palette = BASE_PREDICTOR_PALETTE
  ) +
  relabel_for_bootstrap_plot()

ggplot2::ggsave(
  resampled_rolling_r2s_plot,
  filename = file.path(PROJ_DIR, 'bootstrapped-rolling-r2s.png'),
  width = 9,
  height = 9 / 1.5
)

Indeed, this looks like the bootstrapped plot from ASA.5 Note that, for a given measure, the bootstrapped R-squared values are slightly smaller across the board compared to the non-bootstrapped values.6 This is perhaps not surprising, as resampling tends to have a “shrinking” effect on figures. Intuitively, this noise could be associated with scheduling bias, injuries, etc.

Extending prior art

Now we incorporate gamestate-aware measures.7 In this case, we’re mostly interested game time where the score is close.

Most commonly, “close” is defined as a tied score, whether it’s 0-0, 1-1, etc. Let’s call this the gd = 0 approach. The indirect assumption is that teams start to play more conservatively when leading (and more aggressively when trailing), thereby distorting the ratio of shots, goals, and xG that we might otherwise expect given the relative quality of the teams. By excluding events when the game is not tied, we might achieve more “signal” in our measures of performance.

We might also define “close” as periods when the absolute difference in goals is just 1, so 0-1, 1-2, 3-2, 5-4, etc. Let’s call this the abs(gd) = 1 approach. This definition indirectly assumes that teams don’t start to play more complacently when ahead until they have a 2 goal lead. This approach would capture more game time (and have higher tallies of goals, shots, etc.) compared to the gd = 0 approach, although less time than just using the whole match.

In the plot below, I’ve split out each of the measures of past performance into its own facet, paired with the neutral gamestate version using the gd = 0 definition. This makes it easy to see which pairs of past and future performance measures are improved by subsetting the past performance measure to neutral gamestates. To reduce the scope and enhance the focus of the work, I’ve chosen to look at just points per game as our measure of future performance, and just the goals and xG ratios as our measures of past performance.

Plotting the bootstrapped R-squared values, including the neutral game state metrics
BASE_AND_NEUTRAL_PRETTY_PREDICTOR_LOOKUP <- list(
  'xG Ratio' = c(
    'past_xg_ratio',
    'past_xg_neutral_ratio'
  ),
  'Goal Ratio' = c(
    'past_g_ratio',
    'past_g_neutral_ratio'
  ) #,
  # 'Total Shots Ratio' = c(
  #   'past_shot_ratio',
  #   'past_shot_neutral_ratio'
  # ),
  # 'Shots on Target Ratio' = c(
  #   'past_sot_ratio',
  #   'past_sot_neutral_ratio'
  # )
)
BASE_AND_NEUTRAL_TARGET_PREDICTOR <- 'future_ppg'
unlist_and_invert <- function(x) {
  unlisted <- unlist(x)
  names(unlisted) <- rep(names(x), lengths(x))
  setNames(names(unlisted), unname(unlisted))
}
BASE_AND_NEUTRAL_PRETTY_PREDICTOR_GROUP_NAMES <- unlist_and_invert(
  BASE_AND_NEUTRAL_PRETTY_PREDICTOR_LOOKUP
)

IS_NEUTRAL_PALETTE <- c(
  `TRUE` =  '#00b4d8',
  `FALSE` = '#caf0f8'
)

plotting_rolling_r2s_with_neutral <- function(rolling_r2s, gamestate_description) {
  prettified_rolling_r2s <- rolling_r2s |> 
    dplyr::filter(
      predictor %in% names(BASE_AND_NEUTRAL_PRETTY_PREDICTOR_GROUP_NAMES),
      target == BASE_AND_NEUTRAL_TARGET_PREDICTOR
    ) |> 
    dplyr::mutate(
      is_neutral = stringr::str_detect(predictor, 'neutral'),
      pretty_predictor_group = factor(
        BASE_AND_NEUTRAL_PRETTY_PREDICTOR_GROUP_NAMES[predictor],
        levels = names(BASE_AND_NEUTRAL_PRETTY_PREDICTOR_LOOKUP)
      )
    ) |> 
    ## put xG on top
    dplyr::arrange(league_group, pretty_predictor_group, is_neutral)
  
  min_rolling_r2s <- prettified_rolling_r2s |> 
    dplyr::filter(r2 > 0.5) |> 
    dplyr::group_by(league_group, predictor) |> 
    dplyr::slice_min(game_idx, n = 1) |> 
    dplyr::ungroup()

  prettified_rolling_r2s |> 
    ggplot2::ggplot() +
    ggplot2::aes(
      x = game_idx,
      y = r2,
      group = predictor
    ) +
    ggplot2::geom_segment(
      data = min_rolling_r2s,
      ggplot2::aes(
        color = is_neutral,
        x = 0,
        xend = game_idx,
        y = r2,
        yend = r2
      ),
      linetype = 2,
      linewidth = 0.5
    ) +
    ggplot2::geom_segment(
      data = min_rolling_r2s,
      ggplot2::aes(
        color = is_neutral,
        x = game_idx,
        xend = game_idx,
        y = 0,
        yend = r2
      ),
      linetype = 2,
      linewidth = 0.5
    ) +
    ggplot2::geom_line(
      linewidth = 1,
      ggplot2::aes(color = is_neutral)
    ) +
    ggplot2::scale_color_manual(
      values = IS_NEUTRAL_PALETTE
    ) +
    ggplot2::guides(
      color = 'none'
    ) +
    ggplot2::scale_y_continuous(
      limits = c(0, 1),
      breaks = seq(0, 1, by = 0.2),
      labels = scales::number_format(accuracy = 0.1)
    ) +
    ggplot2::scale_x_continuous(
      breaks = seq.int(0, 35, by = 5),
      labels = seq.int(0, 35, by = 5)
    ) +
    ggplot2::theme(
      panel.grid.major = ggplot2::element_blank(),
      plot.subtitle = ggtext::element_markdown(size = 12)
    ) +
    ggplot2::facet_grid(league_group~pretty_predictor_group) +
    ggplot2::labs(
      title = 'Predictiveness of Season-to-Date Metrics on Rest-of-Season Points Per Game',
      subtitle = glue::glue('Are <b><span style="color:{IS_NEUTRAL_PALETTE[["FALSE"]]}">full match</span></b> metrics more predictive when subset to <b><span style="color:{IS_NEUTRAL_PALETTE[["TRUE"]]}">{gamestate_description}</span></b>?'),
      x = 'Randomized Match Day',
      y = 'Average Bootstrapped R<sup>2</sup>',
      caption = BOOTSTRAPPED_CAPTION_LABEL,
      tag = TAG_LABEL
    )
}

rolling_r2s_with_neutral_gd0_plot <- resampled_rolling_r2s_gd0 |>
  plotting_rolling_r2s_with_neutral('tied gamestates')

rolling_r2s_with_neutral_abs_gd1_plot <- resampled_rolling_r2s_abs_gd1 |> 
  plotting_rolling_r2s_with_neutral('gamestates with absolute goal difference <= 1')

ggplot2::ggsave(
  rolling_r2s_with_neutral_gd0_plot,
  filename = file.path(PROJ_DIR, 'bootstrapped-rolling-r2s-with-neutral-gd-0.png'),
  width = 8,
  height = 8 / 1.5
)

ggplot2::ggsave(
  rolling_r2s_with_neutral_abs_gd1_plot,
  filename = file.path(PROJ_DIR, 'bootstrapped-rolling-r2s-with-neutral-abs-gd-1.png'),
  width = 8,
  height = 8 / 1.5
)

So we see that subsetting xG and goals to tied gamestates reduces how predictive they are of future points per game, both for the Big 5 leagues and the MLS. Perhaps this is not surprising, as we’re disregarding a lot of data (over 50%!) that is included in full match sums.

Let’s see if the same plot using the abs(gd) = 1 neutral gamestate definition looks any different.

Ah, so using this alternative definition of “neutral” gamestate, the predictiveness of the season-to-date xG and goal ratios is much closer to the full match analogues, so perhaps the abs(gd) = 1 is superior. Nonetheless, it’s evident that neutral gamestate xG and goal ratios are no better than the full match ratios, indicating that there is no incremental value to be had with focusing on neutral gamestates. This falls in line with what Ijtsma hypothesized–that the effect of within-match complacency is minimal, and accounting for it (by reducing tallies to neutral gamestates) is more than likely to reduce predictiveness.

Discussion and Conclusion

Ok, so we didn’t find a ground-breaking result with neutral gamestates, although the value of a “null result” should not be understated.

To offer some kind of insight that might be useful, I present the game index (i.e. match day) at which my data shows that the season-to-date measures start to provide a reliable signal for forecasting future performance.8 I define the reliability “threshold” as the first week in which the running R-squared value exceeds 0.5. For those reading the footnotes or broadly aware of the literature on xG predictiveness, this is effectively my look at the common adage that xG becomes a reliable indicator of future performance somewhere between the 5 and 15 game mark.9

Calculating when past performance measures become reliable indicators of future performance
rolling_r2s_abs_gd1 |> 
  dplyr::filter(
    league_group == 'Big 5',
    target == 'future_g_ratio',
    predictor %in% names(BASE_PRETTY_PREDICTOR_NAMES)
  ) |> 
  dplyr::mutate(
    pretty_predictor = BASE_PRETTY_PREDICTOR_NAMES[predictor]
  ) |> 
  dplyr::filter(r2 > 0.5) |> 
  dplyr::group_by(pretty_predictor) |>
  dplyr::slice_min(game_idx, n = 1) |> 
  dplyr::ungroup() |> 
  tidyr::complete(
    pretty_predictor = BASE_PRETTY_PREDICTOR_NAMES
  ) |> 
  dplyr::arrange(game_idx) |> 
  dplyr::transmute(
    `Predictor` = pretty_predictor, 
    `Match Day` = tidyr::replace_na(as.character(game_idx), '-')
  ) |> 
  knitr::kable(
    align = 'lr'
  ) |> 
  clipr::write_clip()

rolling_r2s_abs_gd1 |> 
  dplyr::filter(
    league_group == 'Big 5',
    target == 'future_ppg',
    predictor %in% names(BASE_PRETTY_PREDICTOR_NAMES)
  ) |> 
  dplyr::mutate(
    pretty_predictor = BASE_PRETTY_PREDICTOR_NAMES[predictor]
  ) |> 
  dplyr::filter(r2 > 0.5) |> 
  dplyr::group_by(pretty_predictor) |>
  dplyr::slice_min(game_idx, n = 1) |> 
  dplyr::ungroup() |> 
  tidyr::complete(
    pretty_predictor = BASE_PRETTY_PREDICTOR_NAMES
  ) |> 
  dplyr::arrange(game_idx) |> 
  dplyr::transmute(
    `Predictor` = pretty_predictor, 
    `Match Day` = tidyr::replace_na(as.character(game_idx), '-')
  ) |> 
  knitr::kable(
    align = 'lr'
  ) |> 
  clipr::write_clip()
Earliest match day in which the past metric becomes a reliable measure of future goals ratio in the Big 5 leagues.
Predictor Match Day
xG Ratio 9
Goal Ratio 11
Shots on Target Ratio 12
Points per Game 13
Total Shots Ratio -
Earliest match day in which the past metric becomes a reliable measure of future points per game in the Big 5 leagues.
Predictor Match Day
xG Ratio 13
Goal Ratio 15
Points per Game 16
Shots on Target Ratio 16
Total Shots Ratio -

Appendix

I was interested in evaluating running in-season R-squared values for a few more measures of past performance:

  1. passes into the penalty area ratio
  2. expected goals on target (xGOT) ratio
  3. xG + expected assisted goals10 (xG+xAG) ratio

Below is the same running R-squared plot shown before, just with these new measures, along side our tried and true xG ratio measure.

Aha! It seems that we have identified a measure–xG+xAG ratio (purple above)–that becomes reliable for forecasting rest-of-season performance even earlier than xG ratio. Specifically, it exceeds the reliability threshold at the 7 and 10 game marks for future goal ratio and future points per game respectively. That’s a non-trivial improvement compared to the 9 and 13 match day marks for xG ratio.

Alas, xG+xAG metaphorically stands on the shoulders of xG, adding in the component of expected assists (xA) on shots. Assuming that the underlying xA model is well calibrated, perhaps we should not be surprised to see that the composite xG+xAG measure outperforms all others.

On the other hand, I suppose I am a bit surprised that xGOT ratio does not seem to do nearly as well as xG ratio in terms of forecasting rest-of-season performancet. The implication is that there is value in the xG of off-target shots for projecting the future. By comparison, shots on target ratio tend to be more predictive than just the shots ratio, meaning that including off-target shots introduces noise that reduces predictiveness. That’s an interesting difference in shots and xG!

No matching items

Footnotes

  1. If you know anything about the broader soccer analytics discourse, you’ve more than likely heard about one of the Ijtsma’s findings regarding the predictiveness of xG. As Eliot puts it: “If you’ve ever heard or read someone say that it takes 5-10 (4 in the article) games for xG to be predictive, they probably read this article or got it second hand from someone who has.”↩︎

  2. Note that there are some critics of the choice to rely on R-squared as a measure of predictiveness. For the purpose of this post, we’ll take that criticism on the chin.↩︎

  3. A “ratio” here is broadly defined in pseudo-code as team's value / (team's value + opponent's value). Ijtsma’s notes in a reply to a comment that one might arguably use a difference formula, i.e. team's value - opponent's value, as ratios are susceptible to noise when values are themselves fractional. This is most relevant for xG, and not so relevant for shots and goals, which are inherently discrete.↩︎

  4. Notably, I exclude the COVID-19 influenced 2020 season, so as to reduce a bit of noise in the data. Although bootstrapping games within-season should reduce much of this noise, in practice, FBref is missing some game-level xG data for 2020 matches, so this choice is sort of just a pragmatic one.↩︎

  5. I only perform 100 bootstraps instead of 1,000 like Eliot did since I found 100 sufficient for creating visual clarity.↩︎

  6. This may not be easy to see directly since the bootstrapped and non-bootstrapped curves are not plotted altogether. Nonetheless, one can compare the peaks of the curves against the y-axis in this plot compared to the prior one to verify the suppressing effect of the resampling.↩︎

  7. If you’ve been following the code, these measures have already been calculated.↩︎

  8. The tables that follow show non-bootstrapped R-squared values. The bootstrapped R-squared values only exceed 0.5 for a handful of measures, typically in the 15-20 game range.↩︎

  9. Note that Ijtsma says that xG ratio becomes useful as early as the fourth match week. However, his statement is based purely on a visual assessment of when the R-squared curve generally “settles down” after the expected early season up-tick. Looking at his graph, the R-squared values for xG don’t actually exceed the 0.5 threshold until some point between the 10th and 15th match days.↩︎

  10. FBref’s explanation regarding how xAG differs from expected assists (xA): “xA, or expected assists, is the likelihood that a given completed pass will become a goal assist… Players receive xA for every completed pass regardless of whether a shot occurred or not… [Expected Assisted Goals (xAG)] indicates a player’s ability to set up scoring chances… Players receive xAG only when a shot is taken after a completed pass.”↩︎

To leave a comment for the author, please follow the link and comment on their blog: Tony's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)