Regression to the Mean
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We also see this in sports, Hollinger refers to it as the fluke rule. Bill James calls it the plexiglas principle, you might call it the Sports Illustrated Jinx. They all are different names from different people who have observed the same thing mean reversion or regression to the mean.
Regression to the mean as a concept was first introduced in 1886 by Galton who observed that on average taller parents have tall kids but not as tall as them. In otherwords their heights ‘shrunk’ towards the overall average at the time.
Regression to the mean is a powerful concept in statistics and how can we use it?
The following will be a simple goal-kicking example but if you wanted to, you could use something like supercoach scores (SC) to beat your friends or the bookie. Maybe you want to use it more at a team level? Will total scores return via some regression to the mean?
Step 1 create the dataset
Thankfully we can use fitzRoy for the data we need pretty easily. Here we will take goals (G), behinds (B) and create a variable using mutate
called shots
which just represent total shots on goals (we don’t have access to kicks that were out on the full or dropped short) so this becomes the sum of goals and behinds.
We are looking at regression to the mean for goal kicking accuracy on a year to year bases on a player level. Hence our group_by(Player, Team, Season)
library(tidyverse) ## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 -- ## v ggplot2 2.2.1 v purrr 0.2.5 ## v tibble 1.4.2 v dplyr 0.7.6 ## v tidyr 0.8.1 v stringr 1.3.1 ## v readr 1.1.1 v forcats 0.3.0 ## -- Conflicts ------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library(fitzRoy) data<-fitzRoy::player_stats data<-data%>% group_by(Player, Team, Season)%>% summarise(TG=sum(G), TB=sum(B)) data%>% group_by(Player,Team, Season)%>% mutate(shots= TG + TB)%>% ggplot(aes(x=shots))+geom_histogram(binwidth = 5)
data<-data%>% group_by(Player,Team, Season)%>% mutate(shots= TG + TB)
Step Two Subsetting our data, finding ours means
Looking at our earlier plot we see that we have lots of players who have failed to take a single shot at goal. We don’t want this to effect what we are doing down the track so lets arbitrary cut off shots at say 20 or more.
subset = data$shots > 19 good = data$TG[subset] had_a_go = data$shots[subset] names(good) = names(had_a_go) = paste(data$Player[subset], data$Season[subset], data$Team[subset])
I know the above is tidyverse, but sometimes its nice to contrast it with base.
indl_player_shot_pert = good/had_a_go league_wide_pert = sum(good)/sum(had_a_go) print(league_wide_pert) ## [1] 0.6007096 sigma2L = league_wide_pert*(1-league_wide_pert)/had_a_go sigma2T = 0 for (i in 1:100000) { weight = 1/(2*(sigma2T + sigma2L)^2) sigma2T = sum(weight*((indl_player_shot_pert - league_wide_pert)^2 - sigma2L))/sum(weight) } print(sigma2T) ## [1] 0.001483404 sqrt(sigma2T) ## [1] 0.03851498 skill = (league_wide_pert/sigma2T + indl_player_shot_pert/sigma2L)/(1/sigma2T + 1/sigma2L) head(sort(skill, decreasing = TRUE)) ## Luke Breust 2014 Hawthorn Tory Dickson 2015 Western Bulldogs ## 0.6681194 0.6577340 ## Christopher Mayne 2012 Fremantle Drew Petrie 2012 North Melbourne ## 0.6554409 0.6526506 ## Barry Hall 2011 Western Bulldogs Cyril Rioli 2016 Hawthorn ## 0.6509844 0.6501355 tail(sort(skill, decreasing = TRUE)) ## Marcus Bontempelli 2017 Western Bulldogs ## 0.5559301 ## Dustin Martin 2013 Richmond ## 0.5543815 ## Daniel Jackson 2011 Richmond ## 0.5518519 ## Lewis Jetta 2010 Sydney ## 0.5489428 ## Jack Billings 2017 St Kilda ## 0.5443335 ## Lindsay Thomas 2011 North Melbourne ## 0.5401664 plot(indl_player_shot_pert, skill, xlim = c(.2, .5), col = 'orange') abline(0, 1, col = 'blue')
plot(density(indl_player_shot_pert), ylim = c(0, 30), col = 'orange', main = 'Distribution of Goal kicking skill in AFL', xlab = 'Goal Kicking percentage') lines(density(skill), col = 'blue') legend('topright', c('True skill', 'Observed'), col = c('blue', 'orange'), lty = 1)
How do we interpret these results?
You might first look at Luke Breust 2014 and it looks as though hes leading in accuracy you see the number below him 0.6701927
and are now thinking maybe that’s how many shots he converted. That is easy to check and we can do that using the script below.
library(tidyverse) library(fitzRoy) data<-fitzRoy::player_stats data%>%filter(Player=="Luke Breust" & Season==2014)%>% summarise(TG=sum(G), TB=sum(B))%>% mutate(shotpert=TG/(TG+TB)) ## TG TB shotpert ## 1 57 12 0.826087
We get 0.826087
which is not 0.6701927
so what is going on here?
Well our 0.6701927
is what we would call our luck adjusted percentage for Luke Breuest in 2014.
But what does this mean and how do we check it?
Our skill formula is as follow in R
skill = (league_wide_pert/sigma2T + indl_player_shot_pert/sigma2L)/(1/sigma2T + 1/sigma2L)
But what does this mean in english?
When we think about regression to the mean really we want to know these three things. * How good is the league or what is the population average * How good was the individual * After regressesing the indivual back what is left is their underlying true skill.
Lets look at the skill formula above, we have our
- league_wide_pert = 0.6009457 (League wide scoring shot percentage)
- indl_player_shot_pert= 57/(57+12) (Luke Breust shot percentage for 2014)
- sigma2T =0.001543792 (underlying true skill standard deviation of league)
- sigma2L = 0.003475507 (underlying Luke Breust standard deviation for 2014) subbing all that into our skill formula above we get
((0.6009457/0.001543792) + (57/(57+12))/0.003475507 )/((1/0.001543792) + (1/0.003475507))
Which just happens to be our skill for Luke Breust show earlier.
With an estimated 0.6701927 or 67.02% true skill for goal conversion, Luke Bruest is above average in true league average skill (0.6009457) and well outside one standard deviation of the population. His observed goal conversation percentage is 0.8261, which our estimate using regression to the mean is 0.6701927 which means he is kicking above his true talent level. It can be thought of as hes getting really lucky. Looking at his most recent completed year, he had a observed percentage of 0.66 (33/33+17) in other words as he aged and should have been getting better, he regressed towards his true mean which is still above league average. This year he is again below his best but still above league average.
fitzRoy::player_stats%>% filter(Player=="Luke Breust" &Season==2018)%>% summarise(TG=sum(G),TB=sum(B)) %>% mutate(accuracy=TG/(TG+TB)) ## TG TB accuracy ## 1 45 21 0.6818182
This shouldn’t surprise us, to put in perspective just how good Luke Breusts raw accuracy is lets create a table ranking players by accuracy over the years. Where not surprisingly he comes out as top of the list, while the mean accuracy of that group is 0.6369235. So it should be no surprise that Lukes accuracy was not sustainable
fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date())%>% select(First.name, Surname, Playing.for, Season, ID, Goals,Behinds)%>% group_by(First.name, Surname, Playing.for, Season, ID)%>% summarise(TG=sum(Goals),TB=sum(Behinds))%>% mutate(accuracy=(TG/(TG+TB)))%>% filter(TG>49 & Season>1964)%>% arrange(desc(accuracy)) ## Returning data from 1897-01-01 to 2018-09-04 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Warning: Unknown columns: `Substitute` ## Finished getting afltables data ## Adding missing grouping variables: `Round`, `Home.team`, `Away.team` ## # A tibble: 504 x 8 ## # Groups: First.name, Surname, Playing.for, Season [504] ## First.name Surname Playing.for Season ID TG TB accuracy ## <chr> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl> ## 1 Luke Breust Hawthorn 2014 11954 57 12 0.826 ## 2 Stephen Milne St Kilda 2002 978 50 11 0.820 ## 3 Tony Lockett St Kilda 1993 990 53 12 0.815 ## 4 Tory Dickson Western Bulldo~ 2015 12043 50 12 0.806 ## 5 Matthew Lloyd Essendon 2008 336 62 16 0.795 ## 6 Barry Richardson Richmond 1971 2414 50 13 0.794 ## 7 Tony Lockett St Kilda 1985 990 79 22 0.782 ## 8 George Young St Kilda 1978 2299 70 21 0.769 ## 9 Peter Hudson Hawthorn 1970 1822 146 44 0.768 ## 10 Tony Lockett St Kilda 1989 990 78 24 0.765 ## # ... with 494 more rows
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.