Site icon R-bloggers

Estimating Shooting Performance Unlikeliness

[This article was first published on Tony's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

Towards the end of each soccer season, we naturally start to look back at player stats, often looking to see who has performed worse compared to their past seasons. We may have different motivations for doing so–e.g. we may be trying to attribute team under-performance to individuals, we may be hypothesizing who is likely to be transferred or resigned, etc.

It’s not uncommon to ask “How unlikely was their shooting performance this season?” when looking at a player who has scored less than goals than expected.1 For instance, if a striker only scores 9 goals on 12 expected goals (xG), their “underperformance” of 3 goals jumps off the page.

The “Outperformance” () ratio–the ratio of a player ’s goals to expected goals –is a common way of evaluating a player’s shooting performance.2

An ratio of 1 indicates that a player is scoring as many goals as expected; a ratio greater than 1 indicates underperformance; and a ratio less than 1 indicates overperformance. Our hypothetical player underperformed with .

In most cases, we have prior seasons of data to use when evaluating a player’s ratio for a given season. For example, let’s say our hypothetical player scored 14 goals on 10 xG () in the season prior, and 12 goals on 8 xG () before that. A after those seasons seems fairly unlikely, especially compared to an “average” player who theoretically achieves ratio every year.

So how do we put a number on the unlikeliness of that for our hypothetical player, accounting for their prior season-long performances?

< section id="data" class="level3">

Data

I’ll be using public data from FBref for the 2018/19 – 2023/24 seasons of the the Big Five European soccer leagues, updated through April 25. Fake data is nice for examples, but ultimately we want to test our methods on real data. Our intuition about the results can be a useful caliber of the sensibility of our results.

< details> < summary>Get shot data
raw_shots <- worldfootballR::load_fb_match_shooting(
  country = COUNTRIES,
  tier = TIERS,
  gender = GENDERS,
  season_end_year = SEASON_END_YEARS
)
#> → Data last updated 2024-04-25 17:52:47 UTC

np_shots <- raw_shots |> 
  ## Drop penalties
  dplyr::filter(
    !dplyr::coalesce((Distance == '13' & round(as.double(xG), 2) == 0.79), FALSE)
  ) |> 
  dplyr::transmute(
    season_end_year = Season_End_Year,
    player_id = Player_Href |> dirname() |> basename(),
    player = Player,
    match_date = lubridate::ymd(Date),
    match_id = MatchURL |> dirname() |> basename(),
    minute = Minute,
    g = as.integer(Outcome == 'Goal'),
    xg = as.double(xG)
  ) |> 
  ## A handful of scored shots with empty xG
  dplyr::filter(!is.na(xg)) |> 
  dplyr::arrange(season_end_year, player_id, match_date, minute)

## Use the more commonly used name when a player ID is mapped to multiple names
##   (This "bug" happens because worldfootballR doesn't go back and re-scrape data
##   when fbref makes a name update.)
player_name_mapping <- np_shots |> 
  dplyr::count(player_id, player) |> 
  dplyr::group_by(player_id) |> 
  dplyr::slice_max(n, n = 1, with_ties = FALSE) |> 
  dplyr::ungroup() |> 
  dplyr::distinct(player_id, player)

player_season_np_shots <- np_shots |> 
  dplyr::summarize(
    .by = c(player_id, season_end_year), 
    shots = dplyr::n(),
    dplyr::across(c(g, xg), sum)
  ) |> 
  dplyr::mutate(
    o = g / xg
  ) |> 
  dplyr::left_join(
    player_name_mapping,
    by = dplyr::join_by(player_id)
  ) |> 
  dplyr::relocate(player, .after = player_id) |> 
  dplyr::arrange(player_id, season_end_year)
player_season_np_shots
#> # A tibble: 15,317 × 7
#>    player_id player          season_end_year shots     g    xg     o
#>    <chr>     <chr>                     <int> <int> <int> <dbl> <dbl>
#>  1 0000acda  Marco Benassi              2018    70     5  4.01 1.25 
#>  2 0000acda  Marco Benassi              2019    59     7  5.61 1.25 
#>  3 0000acda  Marco Benassi              2020    20     1  1.01 0.990
#>  4 0000acda  Marco Benassi              2022    10     0  0.99 0    
#>  5 0000acda  Marco Benassi              2023    19     0  1.35 0    
#>  6 000b3da6  Manuel Iturra              2018     2     0  0.41 0    
#>  7 00242715  Moussa Niakhate            2018    16     0  1.43 0    
#>  8 00242715  Moussa Niakhate            2019    10     1  1.5  0.667
#>  9 00242715  Moussa Niakhate            2020    11     1  1.02 0.980
#> 10 00242715  Moussa Niakhate            2021     9     2  1.56 1.28 
#> # ℹ 15,307 more rows

For illustrative purposes, we’ll focus on one player in particular–James Maddison. Maddison has had a sub-par 2023/2024 season by his own standards, underperforming his xG for the first time since he started playing in the Premier League in 2018/19.

< details open=""> < summary>Maddison’s season-by-season data
player_season_np_shots |> dplyr::filter(player == 'James Maddison')
#> # A tibble: 6 × 7
#>   player_id player         season_end_year shots     g    xg     o
#>   <chr>     <chr>                    <int> <int> <int> <dbl> <dbl>
#> 1 ee38d9c5  James Maddison            2019    81     6  5.85 1.03 
#> 2 ee38d9c5  James Maddison            2020    74     6  5.36 1.12 
#> 3 ee38d9c5  James Maddison            2021    75     8  3.86 2.07 
#> 4 ee38d9c5  James Maddison            2022    72    12  7.56 1.59 
#> 5 ee38d9c5  James Maddison            2023    83     9  7.12 1.26 
#> 6 ee38d9c5  James Maddison            2024    49     4  4.72 0.847
< details> < summary>More variables useful for the rest of the post
TARGET_SEASON_END_YEAR <- 2024

player_np_shots <- player_season_np_shots |> 
  dplyr::mutate(
    is_target = season_end_year == TARGET_SEASON_END_YEAR
  ) |> 
  dplyr::summarize(
    .by = c(is_target, player_id, player),
    dplyr::across(
      c(shots, g, xg),
      \(.x) sum(.x, na.rm = TRUE)
    )
  ) |> 
  dplyr::mutate(o = g / xg) |> 
  dplyr::arrange(player, player_id, is_target)

wide_player_np_shots <- player_np_shots |>
  dplyr::transmute(
    player_id, 
    player,
    which = ifelse(is_target, 'target', 'prior'), 
    shots, g, xg, o
  ) |> 
  tidyr::pivot_wider(
    names_from = which, 
    values_from = c(shots, g, xg, o), 
    names_glue = '{which}_{.value}'
  )

all_players_to_evaluate <- wide_player_np_shots |> 
  tidyr::drop_na(prior_o, target_o) |> 
  dplyr::filter(
    prior_shots >= 50,
    target_shots >= 10,
    prior_g > 0, 
    target_g > 0
  )
< section id="methods-and-analysis" class="level2">

Methods and Analysis

I’ll present 3 approaches to quantifying the “unlikelihood” of a player “underperforming” relative to their prior history.3 I use “prior” to refer to an aggregate of pre-2023/24 statistics, and “target” to refer to 2023/24.

I’ll discuss some of the strengths and weaknesses of each approach as we go along, then summarize the findings in the end.

< section id="approach-1-weighted-percentile-rank" class="level3">

Approach 1: Weighted Percentile Rank

The first approach I’ll present is a handcrafted “ranking” method.

  1. Calculate the proportional difference between the pre-target and target season outperformance ratios– and respectively–for all players .

  1. Weight by the player’s accumulated in prior seasons.4

  1. Calculate the the underperforming unlikeliness as a percentile rank of ascending , i.e. more negative values correspond to a lower percentile.56

This is pretty straightforward to calculate once you’ve got the data prepared in the right format.

< details> < summary>Approach 1 implementation
## `u` for "underperforming unlikelihood"
all_u_approach1 <- all_players_to_evaluate |> 
  dplyr::transmute(
    player,
    prior_o,
    target_o,
    prior_xg,
    weighted_delta_o = prior_shots * (target_o - prior_o) / prior_o,
    u = dplyr::percent_rank(weighted_delta_o)
  ) |> 
  dplyr::arrange(u)

maddison_u_approach1 <- all_u_approach1 |> 
  dplyr::filter(player == 'James Maddison')
< details open=""> < summary>Approach 1 output for Maddison
maddison_u_approach1 |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 1 × 4
#>   player         prior_o target_o      u
#>   <chr>            <dbl>    <dbl>  <dbl>
#> 1 James Maddison    1.38    0.847 0.0321

This approach finds Maddison’s 2023/24 of 0.847 to be about a 3rd percentile outcome. Among the 593 players evaluated, Maddison’s 2023/24 ranks as the 20th most unlikely.

For context, here’s a look at the top 10 most unlikely outcomes for the 2023/24 season.

< details open=""> < summary>Approach 1 output, top 10 underperforming players
all_u_approach1 |> head(10) |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 10 × 4
#>    player             prior_o target_o       u
#>    <chr>                <dbl>    <dbl>   <dbl>
#>  1 Ciro Immobile        1.23     0.383 0      
#>  2 Giovanni Simeone     1.03     0.306 0.00169
#>  3 Nabil Fekir          1.14     0.5   0.00338
#>  4 Wahbi Khazri         1.11     0.322 0.00507
#>  5 Téji Savanier        1.42     0.282 0.00676
#>  6 Adrien Thomasson     1.18     0.282 0.00845
#>  7 Timo Werner          0.951    0.543 0.0101 
#>  8 Gaëtan Laborde       1.02     0.546 0.0118 
#>  9 Kevin Volland        1.18     0.450 0.0135 
#> 10 Robert Lewandowski   1.01     0.759 0.0152 

Ciro Immobile tops the list, with several other notable attacking players who had less than stellar seasons.

Overall, I’d say that this methodology seems to generate fairly reasonable results, but it’s hard to pinpoint why exactly this approach is defensible other than it may lead to intuitive results. Despite the intuitively appealing outcomes, it’s important to scrutinize key factors affecting this methodology’s robustness and validity.

< section id="approach-2-resampling-from-prior-history-of-shots" class="level3">

Approach 2: Resampling from Prior History of Shots

There’s only so much you can do with player-season-level data. We need to dive into shot-level data if we want to more robustly understand the uncertainty of outcomes.

Here’s a “resampling” approach to quantify the underperforming unlikeliness of a player in the target season:

  1. Sample shots (with replacement7) from a player’s past shots . Repeat this for resamples.8
  2. Count the number of resamples in which the outperformance ratio of the sampled shots is less than or equal to the observed in the target season for the player.9 The proportion represents the unlikeness of a given player’s observed (or worse) in the target season.

Here’s how that looks in code.

< details> < summary>Approach 2 implementation
R <- 1000
resample_player_shots <- function(
    shots, 
    n_shots_to_sample, 
    n_sims = R,
    replace = TRUE,
    seed = 42
) {
  
  withr::local_seed(seed)
  purrr::map_dfr(
    1:n_sims,
    \(.sim) {
      sampled_shots <- shots |> 
        slice_sample(n = n_shots_to_sample, replace = replace)
      
      list(
        sim = .sim,
        xg = sum(sampled_shots$xg),
        g = sum(sampled_shots$g),
        o = sum(sampled_shots$g) / sum(sampled_shots$xg)
      )
    }
  )
}

resample_one_player_o <- function(shots, target_season_end_year) {
  target_shots <- shots |>
    dplyr::filter(season_end_year == target_season_end_year)
  
  prior_shots <- shots |>
    dplyr::filter(season_end_year < target_season_end_year)
  
  prior_shots |> 
    resample_player_shots(
      n_shots_to_sample = nrow(target_shots)
    )
}

resample_player_o <- function(shots, players, target_season_end_year = TARGET_SEASON_END_YEAR) {
  purrr::map_dfr(
    players,
    \(.player) {
      shots |> 
        dplyr::filter(player == .player) |> 
        resample_one_player_o(
          target_season_end_year = target_season_end_year
        ) |> 
        dplyr::mutate(
          player = .player
        )
    }
  )
}

maddison_resampled_o <- np_shots |> 
  resample_player_o(
    players = 'James Maddison'
  ) |> 
  dplyr::inner_join(
    wide_player_np_shots |> 
      dplyr::select(
        player,
        prior_o,
        target_o
      ),
    by = dplyr::join_by(player)
  ) |> 
  dplyr::arrange(player)

maddison_u_approach2 <- maddison_resampled_o |>
  dplyr::summarize(
    .by = c(player, prior_o, target_o),
    u = sum(o <= target_o) / n()
  ) |> 
  dplyr::arrange(player)
< details open=""> < summary>Approach 2 output for Maddison
maddison_u_approach2 |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 1 × 4
#>   player         prior_o target_o     u
#>   <chr>            <dbl>    <dbl> <dbl>
#> 1 James Maddison    1.38    0.847 0.163

The plot below should provide a bit of visual intuition as to what’s going on.

These results imply that Maddison’s 2023/24 ratio of 0.847 (or worse) occurs in 16.3% of simulations, i.e. a 16th percentile outcome. That’s a bit higher than what the first approach showed.

How can we feel more confident about this approach? Well, in the first approach, we assumed that the underperforming unlikelihood percentages should be uniform across all players, hence the percentile ranking. I think that’s a good assumption, so we should see if the same bears out with this second approach.

The plot below shows a histogram of the underperforming unlikelihood across all players, where each player’s estimated unlikelihood is grouped into a decile.

< details> < summary>Approach 2 implementation for all players
all_resampled_o <- np_shots |> 
  resample_player_o(
    players = all_players_to_evaluate$player
  ) |> 
  dplyr::inner_join(
    wide_player_np_shots |> 
      ## to make sure we just one Rodri, Danilo, and Nicolás González 
      dplyr::filter(player_id %in% all_players_to_evaluate$player_id) |> 
      dplyr::select(
        player,
        prior_o,
        target_o,
        prior_shots,
        target_shots
      ),
    by = dplyr::join_by(player)
  ) |> 
  dplyr::arrange(player, player)

all_u_approach2 <- all_resampled_o |>
  dplyr::summarize(
    .by = c(player, prior_o, target_o, prior_shots, target_shots),
    u = sum(o <= target_o) / n()
  ) |> 
  dplyr::arrange(u)

Indeed, the histogram shows a fairly uniform distribution, with a bit of irregularity at the very edges.

Looking at who is in the lower end of the leftmost decile, we see some of the same names–Immobile and Savanier–among the ten underperformers. (Withholding judgment on the superiority of any methodology, we can find some solace in seeing some of the same names among the most unlikely underperformers here as we did with approach 1.)

< details open=""> < summary>Approach 2 output, top 10 underperforming players
all_u_approach2 |> head(10) |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 10 × 4
#>    player                    prior_o target_o     u
#>    <chr>                       <dbl>    <dbl> <dbl>
#>  1 Erling Haaland               1.26    0.791 0.004
#>  2 Amine Harit                  1.27    0.262 0.01 
#>  3 Pierre-Emerick Aubameyang    1.07    0.606 0.01 
#>  4 Téji Savanier                1.42    0.282 0.01 
#>  5 Antonio Sanabria             1.05    0.386 0.014
#>  6 Alex Baena                   1.54    0.348 0.016
#>  7 Elye Wahi                    1.38    0.753 0.017
#>  8 Kevin Behrens                1.39    0.673 0.019
#>  9 Ciro Immobile                1.23    0.383 0.02 
#> 10 Ansu Fati                    1.31    0.430 0.024

One familiar face in the printout above is Manchester City’s striker Erling Haaland, whose underperformance this season has been called among fans and the media. His sub-par performance this year ranked as a 7th percentile outcome by approach 1, which is very low, but not quite as low as what this approach finds.

Here are some parting thoughts on this methodology before we look at another:

< section id="approach-3-evaluating-a-player-specific-cumulative-distribution-function-cdf" class="level3">

Approach 3: Evaluating a Player-Specific Cumulative Distribution Function (CDF)

If we assume that the set of goals-to-xG ratios come from a Gamma data-generating process, then we can leverage the properties of a player-level Gamma distribution to assess the unlikelihood of a players ratio.

To calculate the underperforming unlikeliness :

  1. Estimate a Gamma distribution to model a player’s true outperformance ratio across all prior shots, excluding those in the target season–.
  2. Calculate the probability that is less than or equal to the player’s observed in the target season using the Gamma distribution’s cumulative distribution function (CDF).

While that may sound daunting, I promise that it’s not (well, aside from a bit of “magic” in estimating a reasonable Gamma distribution per player).

< details> < summary>Approach 3 implementation
N_SIMS <- 10000

SHOT_TO_SHAPE_MAPPING <- list(
  'from' = c(50, 750),
  'to' = c(1, 25)
)
estimate_one_gamma_distributed_o <- function(
    shots,
    target_season_end_year
) {
  player_np_shots <- shots |> 
    dplyr::mutate(is_target = season_end_year == target_season_end_year)
  
  prior_player_np_shots <- player_np_shots |> 
    dplyr::filter(!is_target)
  
  target_player_np_shots <- player_np_shots |> 
    dplyr::filter(is_target)
  

  agg_player_np_shots <- player_np_shots |>
    dplyr::summarize(
      .by = c(is_target),
      shots = dplyr::n(),
      dplyr::across(c(g, xg), \(.x) sum(.x))
    ) |> 
    dplyr::mutate(o = g / xg)
  
  agg_prior_player_np_shots <- agg_player_np_shots |> 
    dplyr::filter(!is_target)
  
  agg_target_player_np_shots <- agg_player_np_shots |> 
    dplyr::filter(is_target)

  shape <- dplyr::case_when(
    agg_prior_player_np_shots$shots < SHOT_TO_SHAPE_MAPPING$from[1] ~ SHOT_TO_SHAPE_MAPPING$to[2],
    agg_prior_player_np_shots$shots > SHOT_TO_SHAPE_MAPPING$from[2] ~ SHOT_TO_SHAPE_MAPPING$to[2],
    TRUE ~ scales::rescale(
      agg_prior_player_np_shots$shots, 
      from = SHOT_TO_SHAPE_MAPPING$from, 
      to = SHOT_TO_SHAPE_MAPPING$to
    )
  )
  list(
    'shape' = shape,
    'rate' = shape / agg_prior_player_np_shots$o
  )
}

estimate_gamma_distributed_o <- function(
    shots,
    players,
    target_season_end_year
) {
  
  purrr::map_dfr(
    players,
    \(.player) {
      params <- shots |> 
        dplyr::filter(player == .player) |> 
        estimate_one_gamma_distributed_o(
          target_season_end_year = target_season_end_year
        )
      
      list(
        'player' = .player,
        'params' = list(params)
      )
    }
  )
}

select_gamma_o <- np_shots |> 
  estimate_gamma_distributed_o(
    players = 'James Maddison',
    target_season_end_year = TARGET_SEASON_END_YEAR
  ) |> 
  dplyr::inner_join(
    wide_player_np_shots |> 
      dplyr::select(
        player,
        prior_o,
        target_o
      ),
    by = dplyr::join_by(player)
  ) |> 
  dplyr::arrange(player)

maddison_u_approach3 <- select_gamma_o |> 
  dplyr::mutate(
    u = purrr::map2_dbl(
      target_o,
      params,
      \(.target_o, .params) {
        pgamma(
          .target_o, 
          shape = .params$shape, 
          rate = .params$rate,
          lower.tail = TRUE
        )
      }
    ),
    ou = 1 - u
  ) |> 
  tidyr::unnest_wider(params)
< details open=""> < summary>Approach 3 output for Maddison
maddison_u_approach3 |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 1 × 4
#>   player         prior_o target_o     u
#>   <chr>            <dbl>    <dbl>  <dbl>
#> 1 James Maddison    1.38    0.847 0.0679

We see that Maddison’s 2023/24 ratio of 0.847 (or worse) is about a 7th percentile outcome given his prior shot history. The 7th percentile outcome estimated is about not far from the 3rd percentile outcome estimated with approach 1.

To gain some intuition around this approach, we can plot out the Gamma distributed estimate of Maddison’s . The result is a histogram that looks not all that dissimilar to the one from before with resampled shots, just much smoother (since this is a “parametric” approach).

As with approach 2, we should check to see what the distribution of underperforming unlikeliness looks like–we should expect to see a somewhat uniform distribution.

< details> < summary>Approach 3 for all players
all_gamma_o <- np_shots |> 
  estimate_gamma_distributed_o(
    players = all_players_to_evaluate$player,
    target_season_end_year = TARGET_SEASON_END_YEAR
  ) |> 
  dplyr::inner_join(
    wide_player_np_shots |> 
      dplyr::filter(
        player_id %in% all_players_to_evaluate$player_id
      ) |> 
      dplyr::select(
        player,
        prior_o,
        target_o
      ),
    by = dplyr::join_by(player)
  ) |> 
  dplyr::arrange(player)

all_u_approach3 <- all_gamma_o |> 
  dplyr::mutate(
    u = purrr::map2_dbl(
      target_o,
      params,
      \(.target_o, .params) {
        pgamma(
          .target_o, 
          shape = .params$shape, 
          rate = .params$rate,
          lower.tail = TRUE
        )
      }
    )
  ) |> 
  tidyr::unnest_wider(params) |> 
  dplyr::arrange(u)

This histogram has a bit more distortion than our resampling approach, so perhaps it’s a little less calibrated.

Looking at the top 10 strongest underperformers, 3 of the names here–Immobile, Savanier, and Sanabria–are shared with approach 2’s top 10, and 7 are shared with approach 1’s top 10.

< details open=""> < summary>Approach 3 output, top 10 underperforming players
all_u_approach3 |> head(10) |> dplyr::select(player, prior_o, target_o, u)
#> # A tibble: 10 × 4
#>    player           prior_o target_o         u
#>    <chr>              <dbl>    <dbl>      <dbl>
#>  1 Ciro Immobile       1.23    0.383 0.00000533
#>  2 Téji Savanier       1.42    0.282 0.000127  
#>  3 Giovanni Simeone    1.03    0.306 0.000248  
#>  4 Adrien Thomasson    1.18    0.282 0.000346  
#>  5 Wahbi Khazri        1.11    0.322 0.000604  
#>  6 Fabián Ruiz Peña    1.67    0.510 0.00271   
#>  7 Nabil Fekir         1.14    0.5   0.00308   
#>  8 Kevin Volland       1.18    0.450 0.00408   
#>  9 Antonio Sanabria    1.05    0.386 0.00877   
#> 10 Nicolò Barella      1.11    0.417 0.0138

We can visually check the consistency of the results from this method with the prior two with scatter plots of the estimated underperforming unlikeliness from each.

If two of the approaches were perfectly in agreement, then each point, representing one of the 593 evaluated players, would fall along the 45-degree slope 1 line.

With that in mind, we can see that approach 3 is slightly more precise in relation to approach 1, but approach 3 tends to assign slightly higher percentiles to players on the whole. The results from approaches 2 and 3 also have a fair degree of agreement, and the results are more uniformly calibrated.

Stepping back from the results, what can we say about the principles of the methodology?

< section id="conclusion" class="level1">

Conclusion

Here’s a summary of the biggest pros and cons of each approach, along with the result for Maddison.

Approach Description Biggest Pro Biggest Con Maddison 2023/24 Underperformance Unlikeliness
1 Percentile Ranking customizable sensitive to the choice of players to evaluate 3rd percentile
2 Resampling non-parametric limited by player’s shot history 16th percentile
3 Cumulative Distribution Function (CDF) flexible sensitive to choice of distribution parameters 7th percentile

I personally prefer either the second or third approach. In practice, perhaps the best thing to do is take an ensemble average of each approach, as they each have their pros and cons.

< section id="appendix" class="level1">

Appendix

< section id="approach-0-t-test" class="level3">

Approach 0: -test

If you have some background in statistics, applying a -test (using shot-weighted averages and standard deviations) may be an approach that comes to mind.

< details> < summary>Approach 0
u_approach0 <- player_season_np_shots |> 
  dplyr::semi_join(
    all_players_to_evaluate |> dplyr::select(player_id),
    by = dplyr::join_by(player_id)
  ) |> 
  dplyr::filter(season_end_year < TARGET_SEASON_END_YEAR) |> 
  dplyr::summarise(
    .by = c(player),
    mean = weighted.mean(o, w = shots),
    ## could also use a function like Hmisc::wtd.var for weighted variance
    sd = sqrt(sum(shots * (o - weighted.mean(o, w = shots))^2) / sum(shots))
  ) |> 
  dplyr::inner_join(
    wide_player_np_shots |> 
      dplyr::select(player, prior_o, target_o),
    by = dplyr::join_by(player)
  ) |> 
  dplyr::mutate(
    z_score = (target_o - mean) / sd,
    ## multiply by 2 for a two-sided t-test
    u = pnorm(-abs(z_score))
  ) |> 
  dplyr::select(-c(mean, sd)) |> 
  dplyr::arrange(player)
< details open=""> < summary>Approach 0 output
u_approach0 |> 
  dplyr::filter(player == 'James Maddison') |> 
  dplyr::select(player, prior_o, target_o, u) 
#> # A tibble: 1 × 4
#>   player         prior_o target_o      u
#>   <chr>            <dbl>    <dbl>  <dbl>
#> 1 James Maddison    1.38    0.847 0.0707

In reality, this isn’t giving us a percentage of unlikelihood of the outcome. Rather, the p-value measures the probability of underformance as extreme as the underperformance observed in 2023/24 if the null hypothesis is true. The null hypothesis in this case would be that there is no significant difference between the player’s actual ratio in the 2023/24 season and the distribution of outperformance ratios observed in previous seasons.

No matching items
< section id="footnotes" class="footnotes footnotes-end-of-document">

Footnotes

  1. I only consider non-penalty xG and goals for this post. The ability to score penalties at a high success rate is generally seen as a different skill set than the ability to score goals in open play.↩︎

  2. The raw difference between goals and xG is a reasonable measure of shooting performance, but it can “hide” shot volume. Is it fair to compare a player who takes 100 shots in a year and scores 12 goals on 10 xG with a player who takes 10 shots and scores 3 goals on 1 xG? The raw difference is +2 in both cases, indicating no difference in the shooting performance for the two players. But , as defined here, would be 1.2 and 3 respectively, hinting at the former player’s small sample size.↩︎

  3. While I focus on underperformance in this post, “overperformance” could be quantified in a similar (i.e. symmetrical) manner with each technique.↩︎

  4. The weighting here is to reflect the intuition that players who have taken a lot of shots in the past and have an uncharacteristic season are shown as more unlikely than a player with only one prior season of shots, who happens to have a very different goals-to-expected goals ratio in the latter season.↩︎

  5. An overperforming unlikelihood could be calculated by sorting in descending order instead.↩︎

  6. Percentiles greater than 50% generally correspond with players who have overperformed, so really the bottom 50% are the players we’re looking at when we’re considering underperformance unlikeliness.↩︎

  7. Sampling without replacement is also reasonable. However, some players wouldn’t have enough shots from prior seasons to match their volume of shots in 2023/24.↩︎

  8. should be set equal to the number of shots a player has taken in the target season, i.e. 2023/24 here. should be set to some fairly large number, so as to achieve stability in the results.↩︎

  9. Similarly, to estimate the unlikeness of an overperforming season, count up in how many simulations the outperformance ratio of the resampled shots is greater than and calculate the proportion .↩︎

To leave a comment for the author, please follow the link and comment on their blog: Tony's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version