Taking Home Charlie
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Build your own Brownlow Model
G’day all, your friendly neighbourhood PhD student here. I am here to help you all build your own models.
That is very different from what you might be used to which could consist of any one of the following scenarios.
- Here is a model I have build that you are not smart enough too, follow my tips blindly
- Here is a model that I have build with data you don’t have access to, therefore it is better than what you can do follow my tips
- Here is a model that I don’t explain well apart from performance, follow my tips
Hopefully you see my point and see the unfortunate pattern.
This is NOT going to be one of those things, this is a guide to help you build a competitive Brownlow prediction model. It is fully reproducible and you can use the variables created if you want, you can create your own variables but by the end of this process the model is yours.
So let me try to ease some possible thoughts.
- Where can I get the data from?
James Day and myself have worked together to produce an R package called fitzRoy which means that you and anyone else can have easy access to all the good stuff you see on afltables and footywire.
- But what is the point other people do models why build my own when I can just use someone else’s?
That is a good point, there are Brownlow Models floating around, and you could even go to a tipping service and pay for tips. But I assume you are interested in predicting the Brownlow Medallist because you are a football nuffie and as such have ideas about what you think have an effect on how umpires vote.
- So if I have the data and I can build my own model, but I don’t want to because it won’t be any good.
All models are wrong but some are useful – This model is by no means perfect, that isn’t the point. But it will provide you a predicted order that has less ranking error vs Champion Data’s official prediction using data that you don’t have access to without \($\)$
- So what?
In a nutshell, below is a fully reproducible method you can edit and tinker with that without any tinkering with produced a better top 10 Brownlow Medal prediction vs the official data provider without any of the bells and whistles!
What Packages do I need?
library(MASS) library(ordinal) library(fitzRoy) library(tidyverse)
So now you have the packages you need, first steps lets get the data ready
Getting the data
First we can use the fitzRoy package to get all the data from afltables
This will consist of basically 3 steps
Get the data from afltables, we can do this using the
fitzRoy::get_afltables_stats
Create the game by game summaries so we have the total kicks per game, total handballs per game by team for all the statistics available on afltables
Join on scores – because people feel as though margin has an effect on the order of 3,2,1 (Hard to get the 3 votes in a flogging)
We do this using the fitzRoy::match_results
and then we join the datasets together creating a final data frame that has all player statistics at an individual level, joined with their team totals and game score.
df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date()) ## Returning data from 1897-01-01 to 2018-09-20 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Finished getting afltables data df<-df%>%filter(Season>2010) names(df) ## [1] "Season" "Round" ## [3] "Date" "Local.start.time" ## [5] "Venue" "Attendance" ## [7] "Home.team" "HQ1G" ## [9] "HQ1B" "HQ2G" ## [11] "HQ2B" "HQ3G" ## [13] "HQ3B" "HQ4G" ## [15] "HQ4B" "Home.score" ## [17] "Away.team" "AQ1G" ## [19] "AQ1B" "AQ2G" ## [21] "AQ2B" "AQ3G" ## [23] "AQ3B" "AQ4G" ## [25] "AQ4B" "Away.score" ## [27] "First.name" "Surname" ## [29] "ID" "Jumper.No." ## [31] "Playing.for" "Kicks" ## [33] "Marks" "Handballs" ## [35] "Goals" "Behinds" ## [37] "Hit.Outs" "Tackles" ## [39] "Rebounds" "Inside.50s" ## [41] "Clearances" "Clangers" ## [43] "Frees.For" "Frees.Against" ## [45] "Brownlow.Votes" "Contested.Possessions" ## [47] "Uncontested.Possessions" "Contested.Marks" ## [49] "Marks.Inside.50" "One.Percenters" ## [51] "Bounces" "Goal.Assists" ## [53] "Time.on.Ground.." "Substitute" ## [55] "Umpire.1" "Umpire.2" ## [57] "Umpire.3" "Umpire.4" ## [59] "group_id" team_stats<-df%>% dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>% group_by(Date, Season, Round, Playing.for)%>% summarise_if(is.numeric,funs(sum=c(sum(.)))) ## Adding missing grouping variables: `Home.team`, `Away.team` player_stats<-df%>% dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists) ## Adding missing grouping variables: `Home.team`, `Away.team` complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season", "Playing.for"="Playing.for")) dataset_scores<-fitzRoy::match_results names(dataset_scores) ## [1] "Game" "Date" "Round" "Home.Team" ## [5] "Home.Goals" "Home.Behinds" "Home.Points" "Away.Team" ## [9] "Away.Goals" "Away.Behinds" "Away.Points" "Venue" ## [13] "Margin" "Season" "Round.Type" "Round.Number" dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game) #Sometimes when joining datasets together it helps to rename things for consistency colnames(dataset_scores1)[3]<-"Team" colnames(dataset_scores1)[4]<-"Points" colnames(dataset_scores2)[3]<-"Team" colnames(dataset_scores2)[4]<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) # View(dataset_margins) # I have commented this out, but always good to view dataset_margins$Date<-as.Date(dataset_margins$Date) complete_df$Date<-as.Date(complete_df$Date) complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date", "Playing.for"="Team"))
Create the Ratios
Create ratios is important for the model, remember we are interested in their relative performance to their peers. Is it impressive kicking 3 goals if 25 have been kicked? Or is it more impressive getting 3 if only 5 goals have been kicked?
Intuition like this for kicks, handballs, clearances etc would lead us to believe that ratios are better predictors of polling votes over raw statistics.
complete_df_ratio<-complete_df%>% mutate(kick.ratio=Kicks/Kicks_sum, Marks.ratio=Marks/Marks_sum, handball.ratio=Handballs/Handballs_sum, Goals.ratio=Goals/Goals_sum, behinds.ratio=Behinds/Behinds_sum, hitouts.ratio=Hit.Outs/Hit.Outs_sum, tackles.ratio=Tackles/Tackles_sum, rebounds.ratio=Rebounds/Rebounds_sum, inside50s.ratio=Inside.50s/Inside.50s_sum, clearances.ratio=Clearances/Clearances_sum, clangers.ratio=Clangers/Clangers_sum, freefors.ratio=Frees.For/Frees.For_sum, freesagainst.ratio=Frees.Against/Frees.Against_sum, Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum, Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum, contested.marks.ratio=Contested.Marks/Contested.Marks_sum, marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum, one.percenters.ratio=One.Percenters/One.Percenters_sum, bounces.ratio=Bounces/Bounces_sum, goal.assists.ratio=Goal.Assists/Goal.Assists_sum, disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum)) df<-complete_df_ratio%>%dplyr::select(Date, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything()) df<-df%>%dplyr::select(-Brownlow.Votes,everything()) df[is.na(df)] <- 0 # we have to replace NA with 0
Now we have all that we need for the dataset now lets do some modelling!
Fit a proportional odds model
Basically all we are doing is using the dataset we have and fitting a proportional odds model. What this means is that we will end up with a probability that each player will poll 3 votes, 2 votes, 1 vote and 0 votes for each and every single game.
This will involve a few steps
- use
subset
to get our training data - use
filter
to make sure we only have home and away games –this is a data knowledge step, we know that you can not get votes in finals, but the column instead of being left empty is filled with 0s so an easy way to make sure we only have home and away games is using Round as a filter.
in.sample <- subset(df, Season %in% c(2013:2016)) # our training data in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes) in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8", "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24")) names(in.sample) ## [1] "Date" "First.name" ## [3] "Surname" "Season" ## [5] "Round.x" "Playing.for" ## [7] "Brownlow.Votes_sum" "Home.team" ## [9] "Away.team" "Kicks" ## [11] "Marks" "Handballs" ## [13] "Goals" "Behinds" ## [15] "Hit.Outs" "Tackles" ## [17] "Rebounds" "Inside.50s" ## [19] "Clearances" "Clangers" ## [21] "Frees.For" "Frees.Against" ## [23] "Contested.Possessions" "Uncontested.Possessions" ## [25] "Contested.Marks" "Marks.Inside.50" ## [27] "One.Percenters" "Bounces" ## [29] "Goal.Assists" "Round.y" ## [31] "Kicks_sum" "Marks_sum" ## [33] "Handballs_sum" "Goals_sum" ## [35] "Behinds_sum" "Hit.Outs_sum" ## [37] "Tackles_sum" "Rebounds_sum" ## [39] "Inside.50s_sum" "Clearances_sum" ## [41] "Clangers_sum" "Frees.For_sum" ## [43] "Frees.Against_sum" "Contested.Possessions_sum" ## [45] "Uncontested.Possessions_sum" "Contested.Marks_sum" ## [47] "Marks.Inside.50_sum" "One.Percenters_sum" ## [49] "Bounces_sum" "Goal.Assists_sum" ## [51] "Round" "Points" ## [53] "Game" "margin" ## [55] "kick.ratio" "Marks.ratio" ## [57] "handball.ratio" "Goals.ratio" ## [59] "behinds.ratio" "hitouts.ratio" ## [61] "tackles.ratio" "rebounds.ratio" ## [63] "inside50s.ratio" "clearances.ratio" ## [65] "clangers.ratio" "freefors.ratio" ## [67] "freesagainst.ratio" "Contested.Possessions.ratio" ## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio" ## [71] "marksinside50.ratio" "one.percenters.ratio" ## [73] "bounces.ratio" "goal.assists.ratio" ## [75] "disposals.ratio" "Brownlow.Votes" in.sample$Player<-paste(in.sample$First.name,in.sample$Surname) in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, margin:Brownlow.Votes) ## Adding missing grouping variables: `Home.team`, `Away.team` in.sample<-in.sample[-c(1,2)] #removed for 'cleaner look' fm1<-clm(Brownlow.Votes~ handball.ratio + Marks.ratio + disposals.ratio+ hitouts.ratio+ freefors.ratio + freesagainst.ratio + tackles.ratio + Goals.ratio + behinds.ratio + Contested.Possessions.ratio+ Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + marksinside50.ratio + clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + bounces.ratio+ goal.assists.ratio +margin, data = in.sample) ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met
Variable selection
Nothing fancy here, remember the point is that you can pick and choose what variables you want as your predictors, here I am just going to use stepwise selection and go backwards.
But by all means, I am not saying this is the best way, I’m just getting to a result as quickly as I can with little thinking about the variables.
Ideally you might think about the variables and test them out for fit but moving on.
fm2<- stepAIC(fm1, direction='backward',type=AIC) ## Start: AIC=13631.84 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + ## marksinside50.ratio + clearances.ratio + rebounds.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## goal.assists.ratio + margin ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio ## - Rescale variables? ## In addition: Absolute and relative convergence criteria were met ## Df AIC ## - Uncontested.Possessions.ratio 1 13630 ## - rebounds.ratio 1 13630 ## - behinds.ratio 1 13631 ## - goal.assists.ratio 1 13631 ## - inside50s.ratio 1 13631 ## <none> 13632 ## - freesagainst.ratio 1 13638 ## - Contested.Possessions.ratio 1 13639 ## - clearances.ratio 1 13639 ## - freefors.ratio 1 13641 ## - bounces.ratio 1 13647 ## - clangers.ratio 1 13650 ## - marksinside50.ratio 1 13650 ## - Marks.ratio 1 13650 ## - handball.ratio 1 13653 ## - contested.marks.ratio 1 13656 ## - one.percenters.ratio 1 13669 ## - tackles.ratio 1 13690 ## - hitouts.ratio 1 13790 ## - disposals.ratio 1 13863 ## - Goals.ratio 1 14583 ## - margin 1 15649 ## ## Step: AIC=13630.1 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## clangers.ratio + contested.marks.ratio + marksinside50.ratio + ## clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + ## bounces.ratio + goal.assists.ratio + margin ## ## Df AIC ## - rebounds.ratio 1 13628 ## - behinds.ratio 1 13629 ## - goal.assists.ratio 1 13629 ## - inside50s.ratio 1 13629 ## <none> 13630 ## - freesagainst.ratio 1 13636 ## - clearances.ratio 1 13638 ## - freefors.ratio 1 13639 ## - bounces.ratio 1 13645 ## - clangers.ratio 1 13648 ## - Marks.ratio 1 13648 ## - marksinside50.ratio 1 13648 ## - handball.ratio 1 13652 ## - contested.marks.ratio 1 13655 ## - Contested.Possessions.ratio 1 13661 ## - one.percenters.ratio 1 13667 ## - tackles.ratio 1 13688 ## - hitouts.ratio 1 13789 ## - Goals.ratio 1 14581 ## - disposals.ratio 1 14601 ## - margin 1 15647 ## ## Step: AIC=13628.54 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## clangers.ratio + contested.marks.ratio + marksinside50.ratio + ## clearances.ratio + inside50s.ratio + one.percenters.ratio + ## bounces.ratio + goal.assists.ratio + margin ## ## Df AIC ## - behinds.ratio 1 13627 ## - goal.assists.ratio 1 13627 ## - inside50s.ratio 1 13628 ## <none> 13628 ## - freesagainst.ratio 1 13635 ## - clearances.ratio 1 13636 ## - freefors.ratio 1 13638 ## - bounces.ratio 1 13644 ## - marksinside50.ratio 1 13646 ## - clangers.ratio 1 13646 ## - Marks.ratio 1 13647 ## - contested.marks.ratio 1 13654 ## - handball.ratio 1 13655 ## - Contested.Possessions.ratio 1 13660 ## - one.percenters.ratio 1 13668 ## - tackles.ratio 1 13686 ## - hitouts.ratio 1 13787 ## - Goals.ratio 1 14591 ## - disposals.ratio 1 14799 ## - margin 1 15645 ## ## Step: AIC=13627.36 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## goal.assists.ratio + margin ## ## Df AIC ## - goal.assists.ratio 1 13626 ## - inside50s.ratio 1 13627 ## <none> 13627 ## - freesagainst.ratio 1 13634 ## - clearances.ratio 1 13634 ## - freefors.ratio 1 13637 ## - bounces.ratio 1 13643 ## - Marks.ratio 1 13646 ## - clangers.ratio 1 13646 ## - marksinside50.ratio 1 13651 ## - contested.marks.ratio 1 13652 ## - handball.ratio 1 13656 ## - Contested.Possessions.ratio 1 13660 ## - one.percenters.ratio 1 13666 ## - tackles.ratio 1 13685 ## - hitouts.ratio 1 13786 ## - Goals.ratio 1 14591 ## - disposals.ratio 1 14806 ## - margin 1 15646 ## ## Step: AIC=13626.24 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## margin ## ## Df AIC ## - inside50s.ratio 1 13626 ## <none> 13626 ## - clearances.ratio 1 13633 ## - freesagainst.ratio 1 13633 ## - freefors.ratio 1 13636 ## - bounces.ratio 1 13642 ## - Marks.ratio 1 13644 ## - clangers.ratio 1 13645 ## - marksinside50.ratio 1 13651 ## - contested.marks.ratio 1 13651 ## - handball.ratio 1 13654 ## - Contested.Possessions.ratio 1 13659 ## - one.percenters.ratio 1 13664 ## - tackles.ratio 1 13684 ## - hitouts.ratio 1 13784 ## - Goals.ratio 1 14590 ## - disposals.ratio 1 14804 ## - margin 1 15645 ## ## Step: AIC=13625.93 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## one.percenters.ratio + bounces.ratio + margin ## ## Df AIC ## <none> 13626 ## - freesagainst.ratio 1 13633 ## - clearances.ratio 1 13634 ## - freefors.ratio 1 13635 ## - bounces.ratio 1 13642 ## - Marks.ratio 1 13643 ## - clangers.ratio 1 13644 ## - marksinside50.ratio 1 13650 ## - contested.marks.ratio 1 13650 ## - Contested.Possessions.ratio 1 13658 ## - handball.ratio 1 13662 ## - one.percenters.ratio 1 13663 ## - tackles.ratio 1 13683 ## - hitouts.ratio 1 13784 ## - Goals.ratio 1 14600 ## - disposals.ratio 1 15042 ## - margin 1 15645
Lets do the prediction
Instead of just taking the data from afltables in 2017, lets do something just a little bit different. Lets use footywire as our out.sample
This might seem like a bit of a weird step, but it is really valuable. Footywire has more data than afltables for games, it contains stats like tackles inside 50 as an example. Also as an example within fitzRoy we essentially have 2 data sources for you to use. Either afltables or footywire. There are benefits to both for example afltables data goes further back, but doesn’t have extra columns that footywire has that you might want to use for example supercoach scores.
names(fitzRoy::player_stats) ## [1] "Date" "Season" "Round" "Venue" ## [5] "Player" "Team" "Opposition" "Status" ## [9] "Match_id" "CP" "UP" "ED" ## [13] "DE" "CM" "GA" "MI5" ## [17] "One.Percenters" "BO" "TOG" "K" ## [21] "HB" "D" "M" "G" ## [25] "B" "T" "HO" "GA1" ## [29] "I50" "CL" "CG" "R50" ## [33] "FF" "FA" "AF" "SC" ## [37] "CCL" "SCL" "SI" "MG" ## [41] "TO" "ITC" "T5" df_2017<-fitzRoy::player_stats%>% filter(Season==2017) team_stats_out<-df_2017%>% dplyr::select(Date, Player,Season, Round, Team, CP:T5)%>% group_by(Date,Season, Round, Team)%>% summarise_if(is.numeric,funs(sum=c(sum(.)))) player_stats_out<-df_2017%>% dplyr::select(Date, Player,Season, Round, Team, CP:T5) complete_df_out<-left_join(player_stats_out,team_stats_out, by=c("Date"="Date", "Season"="Season", "Team"="Team")) dataset_scores<-fitzRoy::match_results names(dataset_scores) ## [1] "Game" "Date" "Round" "Home.Team" ## [5] "Home.Goals" "Home.Behinds" "Home.Points" "Away.Team" ## [9] "Away.Goals" "Away.Behinds" "Away.Points" "Venue" ## [13] "Margin" "Season" "Round.Type" "Round.Number" dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game) colnames(dataset_scores1)[3]<-"Team" colnames(dataset_scores1)[4]<-"Points" colnames(dataset_scores2)[3]<-"Team" colnames(dataset_scores2)[4]<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) dataset_margins$Date<-as.Date(dataset_margins$Date) complete_df_out$Date<-as.Date(complete_df_out$Date) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) complete_df_out<-left_join(complete_df_out,dataset_margins,by=c("Date"="Date", "Team"="Team")) names(complete_df_out) ## [1] "Date" "Player" "Season" ## [4] "Round.x" "Team" "CP" ## [7] "UP" "ED" "DE" ## [10] "CM" "GA" "MI5" ## [13] "One.Percenters" "BO" "TOG" ## [16] "K" "HB" "D" ## [19] "M" "G" "B" ## [22] "T" "HO" "GA1" ## [25] "I50" "CL" "CG" ## [28] "R50" "FF" "FA" ## [31] "AF" "SC" "CCL" ## [34] "SCL" "SI" "MG" ## [37] "TO" "ITC" "T5" ## [40] "Round.y" "CP_sum" "UP_sum" ## [43] "ED_sum" "DE_sum" "CM_sum" ## [46] "GA_sum" "MI5_sum" "One.Percenters_sum" ## [49] "BO_sum" "TOG_sum" "K_sum" ## [52] "HB_sum" "D_sum" "M_sum" ## [55] "G_sum" "B_sum" "T_sum" ## [58] "HO_sum" "GA1_sum" "I50_sum" ## [61] "CL_sum" "CG_sum" "R50_sum" ## [64] "FF_sum" "FA_sum" "AF_sum" ## [67] "SC_sum" "CCL_sum" "SCL_sum" ## [70] "SI_sum" "MG_sum" "TO_sum" ## [73] "ITC_sum" "T5_sum" "Round" ## [76] "Points" "Game" "margin" ####create the new ratios complete_df_ratio_out<-complete_df_out%>% mutate(kick.ratio=K/K_sum, Marks.ratio=M/M_sum, handball.ratio=HB/HB_sum, Goals.ratio=G/G_sum, behinds.ratio=B/B_sum, hitouts.ratio=HO/HO_sum, tackles.ratio=T/T_sum, rebounds.ratio=R50/R50_sum, inside50s.ratio=I50/I50_sum, clearances.ratio=(CCL+SCL)/(CCL_sum+SCL_sum), clangers.ratio=CL/CL_sum, freefors.ratio=FF/FF_sum, freesagainst.ratio=FA/FA_sum, Contested.Possessions.ratio=CP/CP_sum, Uncontested.Possessions.ratio=UP/UP_sum, contested.marks.ratio=CM/CM_sum, marksinside50.ratio=MI5/MI5_sum, one.percenters.ratio=One.Percenters/One.Percenters_sum, bounces.ratio=BO/BO_sum, goal.assists.ratio=GA/GA_sum, disposals.ratio=D/D_sum) conforming<-complete_df_ratio_out%>% dplyr::select(Player, Date, Season, Round.x, Team, margin, kick.ratio:disposals.ratio) conforming$Brownlow.Votes<-0 out.sample=conforming newdata <- out.sample[ , -ncol(out.sample)] pre.dict <- predict(fm2,newdata=newdata, type='prob') pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3") newdata.pred <- cbind.data.frame(newdata, pre.dict.m)
Expected Votes
A proportional odds model makes the players total probabilities for each voting category sum to 1 for every player that the player plays in.
That is for Dustin Martin in round 1 2017, \(1= \sum_{k=0}^3 p_{k}\) which makes sense.
But we know that there is another harder restriction. That is only one player can get 3 votes, only one player can get 2 votes and only one player can get the 1 vote.
Now lets add in our much tougher restriction. This will mean the probability of polling 3 votes will sum to 1 across all players. We will call this the std (standardised) votes.
#### Step 1: Get expected value on Votes newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3 ####Join back on matchID whoops! get_match_ID<-fitzRoy::player_stats xx<-get_match_ID%>%dplyr::select(Date, Player, Match_id) newdata.pred<-left_join(newdata.pred, xx, by=c("Date"="Date", "Player"="Player")) newdata.pred<-filter(newdata.pred, Date<"2017-09-01") sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1"); sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2"); sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3"); #### Step 3: Add sum of each vote by matchId to big table newdata.pred <- merge(newdata.pred, sum1, by = "Match_id") newdata.pred <- merge(newdata.pred, sum2, by = "Match_id") newdata.pred <- merge(newdata.pred, sum3, by = "Match_id") #### Step 4: Add std1/2/3 newdata.pred$std.1 <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1 newdata.pred$std.2 <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1 newdata.pred$std.3 <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1 #### Step 5: Expected standard game vote newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3 #### Step 6: List of winners newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Team) winners.stdgame <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum ); winners.stdgame <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ] winners.stdgame[1:10, ] ## PlayerName exp_std_game_vote ## 168 Dustin Martin Richmond 34.26070 ## 487 Patrick Dangerfield Geelong 31.19050 ## 618 Tom Mitchell Hawthorn 27.40018 ## 157 Dayne Zorko Brisbane 17.68230 ## 344 Joshua Kelly GWS 16.82974 ## 510 Rory Laird Adelaide 15.98579 ## 652 Zachary Merrett Essendon 14.80721 ## 512 Rory Sloane Adelaide 14.53580 ## 375 Lance Franklin Sydney 14.02063 ## 464 Nathan Fyfe Fremantle 13.81357
But how good is this prediction?
We want our predicted order to be as close to correct as possible, especially for the top 10.
So lets use as a measure of accuracy ranking error which I will define as difference between actual place vs predicted place. For example, we have predicted Dustin Martin to finish first and he did so error is 0, if we predicted Dustin Martin to be runner up (like Champion Data did) then we would have a ranking error of 1.
library(tidyverse) data.frame(stringsAsFactors=FALSE, Predictor = c("anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "anoafl", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD", "CD"), Player = c("Dustin Martin", "Patrick DangerField", "Tom Mitchell", "Dayne Zorko", "Josh Kelly", "Rory Laird", "Zach Merrett", "Rory Sloan", "Lance Franklin", "Nat Fyfe", "Trent Cotchin", "Clayton Oliver", "Callan Ward", "Marcus Bontempelli", "Patrick DangerField", "Dustin Martin", "Clayton Oliver", "Tom Mitchell", "Rory Sloan", "Callan Ward", "Zach Merrett", "Marcus Bontempelli", "Josh Kelly", "Trent Cotchin", "Rory Laird", "Lance Franklin", "Nat Fyfe", "Dayne Zorko"), Ranking.Error = c(0L, 0L, 0L, 12L, 1L, 38L, 5L, 1L, 4L, 2L, 11L, 28L, 8L, 5L, 1L, 1L, 23L, 1L, 2L, 22L, 5L, 0L, 3L, 41L, 30L, 6L, 4L, 10L) )%>%ggplot(aes(x=Ranking.Error, y=Player))+geom_point(aes(colour=Predictor), size=5)+ theme(axis.text.y=element_text(size=12)) +ylab("")
So what we can see here is that by taking just the basic data and making them into game ratios and thinking a bit more about probability and with little thought into variable selection (backwards) we can come up with a prediction that does just as well (if not a little better) as champion data (although obviously biased).
So hopefully now you get the idea, this isn’t meant to be the very best method, but hopefully it does encourage you to instead of using just backwards aic to select your variables you can do something else instead.
What changes to make?
Probably the most obvious one is the variable selection step.
Now you could simply enter in the variables you thought might have an impact on polling brownlow votes and to be honest this is a good way to do things. As a football fan you would hope that your domain knowledge does improve things.
So how to do that?
Lets say you think that only contested possessions, inside 50s and clearances matter after all it is a midfielders award.
fm2<-clm(Brownlow.Votes~ Contested.Possessions.ratio+ clearances.ratio + inside50s.ratio, data = in.sample)
Then you can run everything through as normal and see how your model goes at predicting.
Why use AIC, why not BIC?
That’s a good point and exactly the point I would like to make lots of these choices can be changed and depends on use. The whole point if this exercise was to show that relatively quickly and with only freely available data you can do well in comparison to say CD neural net.
But if you favour interpretable models, you might want to use BIC, it penalises model complexity more so you will end up with models that have less variables.
That step is also quick to change
fm2<- stepAIC(fm1, direction='backward',type=BIC)
Lets do 2018
library(MASS) library(ordinal) library(fitzRoy) library(tidyverse) df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date()) ## Returning data from 1897-01-01 to 2018-09-20 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Finished getting afltables data df<-df%>%filter(Season>2010) names(df) ## [1] "Season" "Round" ## [3] "Date" "Local.start.time" ## [5] "Venue" "Attendance" ## [7] "Home.team" "HQ1G" ## [9] "HQ1B" "HQ2G" ## [11] "HQ2B" "HQ3G" ## [13] "HQ3B" "HQ4G" ## [15] "HQ4B" "Home.score" ## [17] "Away.team" "AQ1G" ## [19] "AQ1B" "AQ2G" ## [21] "AQ2B" "AQ3G" ## [23] "AQ3B" "AQ4G" ## [25] "AQ4B" "Away.score" ## [27] "First.name" "Surname" ## [29] "ID" "Jumper.No." ## [31] "Playing.for" "Kicks" ## [33] "Marks" "Handballs" ## [35] "Goals" "Behinds" ## [37] "Hit.Outs" "Tackles" ## [39] "Rebounds" "Inside.50s" ## [41] "Clearances" "Clangers" ## [43] "Frees.For" "Frees.Against" ## [45] "Brownlow.Votes" "Contested.Possessions" ## [47] "Uncontested.Possessions" "Contested.Marks" ## [49] "Marks.Inside.50" "One.Percenters" ## [51] "Bounces" "Goal.Assists" ## [53] "Time.on.Ground.." "Substitute" ## [55] "Umpire.1" "Umpire.2" ## [57] "Umpire.3" "Umpire.4" ## [59] "group_id" team_stats<-df%>% dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>% group_by(Date, Season, Round, Playing.for)%>% summarise_if(is.numeric,funs(sum=c(sum(.)))) ## Adding missing grouping variables: `Home.team`, `Away.team` player_stats<-df%>% dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists) ## Adding missing grouping variables: `Home.team`, `Away.team` complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season", "Playing.for"="Playing.for")) dataset_scores<-fitzRoy::match_results names(dataset_scores) ## [1] "Game" "Date" "Round" "Home.Team" ## [5] "Home.Goals" "Home.Behinds" "Home.Points" "Away.Team" ## [9] "Away.Goals" "Away.Behinds" "Away.Points" "Venue" ## [13] "Margin" "Season" "Round.Type" "Round.Number" dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game) colnames(dataset_scores1)[3]<-"Team" colnames(dataset_scores1)[4]<-"Points" colnames(dataset_scores2)[3]<-"Team" colnames(dataset_scores2)[4]<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) # View(dataset_margins) dataset_margins$Date<-as.Date(dataset_margins$Date) complete_df$Date<-as.Date(complete_df$Date) complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date", "Playing.for"="Team")) complete_df_ratio<-complete_df%>% mutate(kick.ratio=Kicks/Kicks_sum, Marks.ratio=Marks/Marks_sum, handball.ratio=Handballs/Handballs_sum, Goals.ratio=Goals/Goals_sum, behinds.ratio=Behinds/Behinds_sum, hitouts.ratio=Hit.Outs/Hit.Outs_sum, tackles.ratio=Tackles/Tackles_sum, rebounds.ratio=Rebounds/Rebounds_sum, inside50s.ratio=Inside.50s/Inside.50s_sum, clearances.ratio=Clearances/Clearances_sum, clangers.ratio=Clangers/Clangers_sum, freefors.ratio=Frees.For/Frees.For_sum, freesagainst.ratio=Frees.Against/Frees.Against_sum, Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum, Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum, contested.marks.ratio=Contested.Marks/Contested.Marks_sum, marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum, one.percenters.ratio=One.Percenters/One.Percenters_sum, bounces.ratio=Bounces/Bounces_sum, goal.assists.ratio=Goal.Assists/Goal.Assists_sum, disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum)) df<-complete_df_ratio%>%dplyr::select(Date,Game, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything()) df<-df%>%dplyr::select(-Brownlow.Votes,everything()) df[is.na(df)] <- 0 in.sample <- subset(df, Season %in% c(2014:2017)) # out.sample <- subset(df, Season == 2014) in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes) in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8", "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24")) names(in.sample) ## [1] "Date" "Game" ## [3] "First.name" "Surname" ## [5] "Season" "Round.x" ## [7] "Playing.for" "Brownlow.Votes_sum" ## [9] "Home.team" "Away.team" ## [11] "Kicks" "Marks" ## [13] "Handballs" "Goals" ## [15] "Behinds" "Hit.Outs" ## [17] "Tackles" "Rebounds" ## [19] "Inside.50s" "Clearances" ## [21] "Clangers" "Frees.For" ## [23] "Frees.Against" "Contested.Possessions" ## [25] "Uncontested.Possessions" "Contested.Marks" ## [27] "Marks.Inside.50" "One.Percenters" ## [29] "Bounces" "Goal.Assists" ## [31] "Round.y" "Kicks_sum" ## [33] "Marks_sum" "Handballs_sum" ## [35] "Goals_sum" "Behinds_sum" ## [37] "Hit.Outs_sum" "Tackles_sum" ## [39] "Rebounds_sum" "Inside.50s_sum" ## [41] "Clearances_sum" "Clangers_sum" ## [43] "Frees.For_sum" "Frees.Against_sum" ## [45] "Contested.Possessions_sum" "Uncontested.Possessions_sum" ## [47] "Contested.Marks_sum" "Marks.Inside.50_sum" ## [49] "One.Percenters_sum" "Bounces_sum" ## [51] "Goal.Assists_sum" "Round" ## [53] "Points" "margin" ## [55] "kick.ratio" "Marks.ratio" ## [57] "handball.ratio" "Goals.ratio" ## [59] "behinds.ratio" "hitouts.ratio" ## [61] "tackles.ratio" "rebounds.ratio" ## [63] "inside50s.ratio" "clearances.ratio" ## [65] "clangers.ratio" "freefors.ratio" ## [67] "freesagainst.ratio" "Contested.Possessions.ratio" ## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio" ## [71] "marksinside50.ratio" "one.percenters.ratio" ## [73] "bounces.ratio" "goal.assists.ratio" ## [75] "disposals.ratio" "Brownlow.Votes" in.sample$Player<-paste(in.sample$First.name,in.sample$Surname) in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, margin:Brownlow.Votes) ## Adding missing grouping variables: `Home.team`, `Away.team` # in.sample<-in.sample[-c(1,2)] summary(in.sample) ## Home.team Away.team Player ## Length:34804 Length:34804 Length:34804 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## Date Season Round.x Playing.for ## Min. :2014-03-14 Min. :2014 Length:34804 Length:34804 ## 1st Qu.:2014-08-31 1st Qu.:2014 Class :character Class :character ## Median :2016-03-24 Median :2016 Mode :character Mode :character ## Mean :2015-12-12 Mean :2016 ## 3rd Qu.:2017-03-23 3rd Qu.:2017 ## Max. :2017-08-27 Max. :2017 ## margin kick.ratio Marks.ratio handball.ratio ## Min. :-145.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.: -26.0000 1st Qu.:0.02985 1st Qu.:0.02597 1st Qu.:0.02646 ## Median : 0.0000 Median :0.04367 Median :0.04274 Median :0.04132 ## Mean : -0.3173 Mean :0.04545 Mean :0.04545 Mean :0.04545 ## 3rd Qu.: 26.0000 3rd Qu.:0.05914 3rd Qu.:0.06173 3rd Qu.:0.06015 ## Max. : 145.0000 Max. :0.15976 Max. :0.19444 Max. :0.19820 ## Goals.ratio behinds.ratio hitouts.ratio tackles.ratio ## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.01923 ## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.03947 ## Mean :0.04545 Mean :0.04545 Mean :0.04545 Mean :0.04545 ## 3rd Qu.:0.07692 3rd Qu.:0.08333 3rd Qu.:0.00000 3rd Qu.:0.06383 ## Max. :0.71429 Max. :1.00000 Max. :1.00000 Max. :0.27273 ## rebounds.ratio inside50s.ratio clearances.ratio clangers.ratio ## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.00000 1st Qu.:0.01887 1st Qu.:0.00000 1st Qu.:0.02128 ## Median :0.03226 Median :0.04000 Median :0.02632 Median :0.04167 ## Mean :0.04545 Mean :0.04545 Mean :0.04545 Mean :0.04545 ## 3rd Qu.:0.07143 3rd Qu.:0.06667 3rd Qu.:0.06818 3rd Qu.:0.06522 ## Max. :0.40909 Max. :0.29412 Max. :0.46154 Max. :0.25000 ## freefors.ratio freesagainst.ratio Contested.Possessions.ratio ## Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.02703 ## Median :0.04167 Median :0.04167 Median :0.04110 ## Mean :0.04545 Mean :0.04545 Mean :0.04545 ## 3rd Qu.:0.07143 3rd Qu.:0.07143 3rd Qu.:0.05932 ## Max. :0.50000 Max. :0.50000 Max. :0.20800 ## Uncontested.Possessions.ratio contested.marks.ratio marksinside50.ratio ## Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.02899 1st Qu.:0.00000 1st Qu.:0.00000 ## Median :0.04348 Median :0.00000 Median :0.00000 ## Mean :0.04545 Mean :0.04543 Mean :0.04545 ## 3rd Qu.:0.06000 3rd Qu.:0.08333 3rd Qu.:0.07143 ## Max. :0.15287 Max. :1.00000 Max. :1.00000 ## one.percenters.ratio bounces.ratio goal.assists.ratio ## Min. :0.00000 Min. :0.00000 Min. :0.00000 ## 1st Qu.:0.01471 1st Qu.:0.00000 1st Qu.:0.00000 ## Median :0.03125 Median :0.00000 Median :0.00000 ## Mean :0.04545 Mean :0.04454 Mean :0.04545 ## 3rd Qu.:0.06452 3rd Qu.:0.00000 3rd Qu.:0.08333 ## Max. :0.40541 Max. :1.00000 Max. :1.00000 ## disposals.ratio Brownlow.Votes ## Min. :0.00000 0:32431 ## 1st Qu.:0.03161 1: 791 ## Median :0.04323 2: 791 ## Mean :0.04545 3: 791 ## 3rd Qu.:0.05788 ## Max. :0.14196 fm1<-clm(Brownlow.Votes~ handball.ratio + Marks.ratio + disposals.ratio+ hitouts.ratio+ freefors.ratio + freesagainst.ratio + tackles.ratio + Goals.ratio + behinds.ratio + Contested.Possessions.ratio+ Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + marksinside50.ratio + clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + bounces.ratio+ goal.assists.ratio +margin, data = in.sample) fm2<- stepAIC(fm1, direction='backward',type=AIC) ## Start: AIC=13509.69 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + ## marksinside50.ratio + clearances.ratio + rebounds.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## goal.assists.ratio + margin ## ## Df AIC ## - Uncontested.Possessions.ratio 1 13508 ## <none> 13510 ## - goal.assists.ratio 1 13510 ## - inside50s.ratio 1 13510 ## - behinds.ratio 1 13510 ## - rebounds.ratio 1 13511 ## - freesagainst.ratio 1 13514 ## - Contested.Possessions.ratio 1 13514 ## - clangers.ratio 1 13518 ## - clearances.ratio 1 13520 ## - bounces.ratio 1 13522 ## - marksinside50.ratio 1 13525 ## - freefors.ratio 1 13525 ## - handball.ratio 1 13529 ## - contested.marks.ratio 1 13538 ## - Marks.ratio 1 13538 ## - one.percenters.ratio 1 13548 ## - tackles.ratio 1 13585 ## - hitouts.ratio 1 13687 ## - disposals.ratio 1 13738 ## - Goals.ratio 1 14490 ## - margin 1 15524 ## ## Step: AIC=13508.46 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## clangers.ratio + contested.marks.ratio + marksinside50.ratio + ## clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + ## bounces.ratio + goal.assists.ratio + margin ## ## Df AIC ## <none> 13508 ## - inside50s.ratio 1 13509 ## - goal.assists.ratio 1 13509 ## - behinds.ratio 1 13509 ## - rebounds.ratio 1 13510 ## - freesagainst.ratio 1 13512 ## - clangers.ratio 1 13517 ## - clearances.ratio 1 13520 ## - bounces.ratio 1 13520 ## - freefors.ratio 1 13524 ## - marksinside50.ratio 1 13524 ## - handball.ratio 1 13529 ## - Contested.Possessions.ratio 1 13535 ## - Marks.ratio 1 13536 ## - contested.marks.ratio 1 13538 ## - one.percenters.ratio 1 13546 ## - tackles.ratio 1 13584 ## - hitouts.ratio 1 13687 ## - disposals.ratio 1 14403 ## - Goals.ratio 1 14489 ## - margin 1 15523 names(fitzRoy::player_stats) ## [1] "Date" "Season" "Round" "Venue" ## [5] "Player" "Team" "Opposition" "Status" ## [9] "Match_id" "CP" "UP" "ED" ## [13] "DE" "CM" "GA" "MI5" ## [17] "One.Percenters" "BO" "TOG" "K" ## [21] "HB" "D" "M" "G" ## [25] "B" "T" "HO" "GA1" ## [29] "I50" "CL" "CG" "R50" ## [33] "FF" "FA" "AF" "SC" ## [37] "CCL" "SCL" "SI" "MG" ## [41] "TO" "ITC" "T5" df_2017<-fitzRoy::get_footywire_stats(9514:9709) ## Getting data from footywire.com ## Finished getting data team_stats_out<-df_2017%>% dplyr::select(Date, Player,Season, Round, Team, CP:SC)%>% group_by(Date,Season, Round, Team)%>% summarise_if(is.numeric,funs(sum=c(sum(.)))) player_stats_out<-df_2017%>% dplyr::select(Date, Player,Season, Round, Team, CP:SC) complete_df_out<-left_join(player_stats_out,team_stats_out, by=c("Date"="Date", "Season"="Season", "Team"="Team")) dataset_scores<-fitzRoy::match_results names(dataset_scores) ## [1] "Game" "Date" "Round" "Home.Team" ## [5] "Home.Goals" "Home.Behinds" "Home.Points" "Away.Team" ## [9] "Away.Goals" "Away.Behinds" "Away.Points" "Venue" ## [13] "Margin" "Season" "Round.Type" "Round.Number" dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game) colnames(dataset_scores1)[3]<-"Team" colnames(dataset_scores1)[4]<-"Points" colnames(dataset_scores2)[3]<-"Team" colnames(dataset_scores2)[4]<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) dataset_margins$Date<-as.Date(dataset_margins$Date) complete_df_out$Date<-as.Date(complete_df_out$Date) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) dataset_margins<-dataset_margins %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) complete_df_out<-left_join(complete_df_out,dataset_margins,by=c("Date"="Date", "Team"="Team")) names(complete_df_out) ## [1] "Date" "Player" "Season" ## [4] "Round.x" "Team" "CP" ## [7] "UP" "ED" "DE" ## [10] "CM" "GA" "MI5" ## [13] "One.Percenters" "BO" "CCL" ## [16] "SCL" "SI" "MG" ## [19] "TO" "ITC" "T5" ## [22] "TOG" "K" "HB" ## [25] "D" "M" "G" ## [28] "B" "T" "HO" ## [31] "GA1" "I50" "CL" ## [34] "CG" "R50" "FF" ## [37] "FA" "AF" "SC" ## [40] "Round.y" "CP_sum" "UP_sum" ## [43] "ED_sum" "DE_sum" "CM_sum" ## [46] "GA_sum" "MI5_sum" "One.Percenters_sum" ## [49] "BO_sum" "CCL_sum" "SCL_sum" ## [52] "SI_sum" "MG_sum" "TO_sum" ## [55] "ITC_sum" "T5_sum" "TOG_sum" ## [58] "K_sum" "HB_sum" "D_sum" ## [61] "M_sum" "G_sum" "B_sum" ## [64] "T_sum" "HO_sum" "GA1_sum" ## [67] "I50_sum" "CL_sum" "CG_sum" ## [70] "R50_sum" "FF_sum" "FA_sum" ## [73] "AF_sum" "SC_sum" "Round" ## [76] "Points" "Game" "margin" ####create the new ratios complete_df_ratio_out<-complete_df_out%>% mutate(kick.ratio=K/K_sum, Marks.ratio=M/M_sum, handball.ratio=HB/HB_sum, Goals.ratio=G/G_sum, behinds.ratio=B/B_sum, hitouts.ratio=HO/HO_sum, tackles.ratio=T/T_sum, rebounds.ratio=R50/R50_sum, inside50s.ratio=I50/I50_sum, clearances.ratio=(CCL+SCL)/(CCL_sum+SCL_sum), clangers.ratio=CL/CL_sum, freefors.ratio=FF/FF_sum, freesagainst.ratio=FA/FA_sum, Contested.Possessions.ratio=CP/CP_sum, Uncontested.Possessions.ratio=UP/UP_sum, contested.marks.ratio=CM/CM_sum, marksinside50.ratio=MI5/MI5_sum, one.percenters.ratio=One.Percenters/One.Percenters_sum, bounces.ratio=BO/BO_sum, goal.assists.ratio=GA/GA_sum, disposals.ratio=D/D_sum) conforming<-complete_df_ratio_out%>% dplyr::select(Player, Date, Season, Round.x, Team, margin, kick.ratio:disposals.ratio) conforming$Brownlow.Votes<-0 out.sample=conforming out.sample[is.na(out.sample)] <- 0 newdata <- out.sample[ , -ncol(out.sample)] pre.dict <- predict(fm2,newdata=newdata, type='prob') pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) #### Step 1: Get expected value on Votes newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3 ####Join back on matchID whoops! get_match_ID<-fitzRoy::player_stats xx<-get_match_ID%>%dplyr::select(Date, Player, Match_id) newdata.pred<-left_join(newdata.pred, xx, by=c("Date"="Date", "Player"="Player")) # newdata.pred<-filter(newdata.pred, Date<"2018-09-01") sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1"); sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2"); sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3"); #### Step 3: Add sum of each vote by matchId to big table newdata.pred <- merge(newdata.pred, sum1, by = "Match_id") newdata.pred <- merge(newdata.pred, sum2, by = "Match_id") newdata.pred <- merge(newdata.pred, sum3, by = "Match_id") #### Step 4: Add std1/2/3 newdata.pred$std.1 <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1 newdata.pred$std.2 <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1 newdata.pred$std.3 <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1 #### Step 5: Expected standard game vote newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3 #### Step 6: List of winners newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Team) winners.stdgame <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum ); winners.stdgame <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ] winners.stdgame[1:10, ] ## PlayerName exp_std_game_vote ## 623 Tom Mitchell Hawthorn 30.71782 ## 179 Dustin Martin Richmond 19.29637 ## 495 Patrick Dangerfield Geelong 18.33302 ## 30 Andrew Gaff West Coast 16.92758 ## 169 Dayne Beams Brisbane 16.86057 ## 119 Clayton Oliver Melbourne 15.68180 ## 193 Elliot Yeo West Coast 15.65408 ## 467 Nathan Fyfe Fremantle 15.36249 ## 254 Jackson Macrae Western Bulldogs 15.15357 ## 382 Lance Franklin Sydney 14.14637
There you go, using the same method that gave a better predicted order vs Champion Data last year is now yours free and fully reproducible. (Find me another blog that does that for you!)
Betting tips
To evaluate the model or a model first lets build a quick one using the above code for 2017 and evaluate it according to the rough metrics that the Brownlow Professor uses.
library(fitzRoy) library(tidyverse) df<-fitzRoy::get_afltables_stats(start_date = "1897-01-01", end_date = Sys.Date()) ## Returning data from 1897-01-01 to 2018-09-20 ## Downloading data ## ## Finished downloading data. Processing XMLs ## Finished getting afltables data df<-df%>%filter(Season>2010) names(df) ## [1] "Season" "Round" ## [3] "Date" "Local.start.time" ## [5] "Venue" "Attendance" ## [7] "Home.team" "HQ1G" ## [9] "HQ1B" "HQ2G" ## [11] "HQ2B" "HQ3G" ## [13] "HQ3B" "HQ4G" ## [15] "HQ4B" "Home.score" ## [17] "Away.team" "AQ1G" ## [19] "AQ1B" "AQ2G" ## [21] "AQ2B" "AQ3G" ## [23] "AQ3B" "AQ4G" ## [25] "AQ4B" "Away.score" ## [27] "First.name" "Surname" ## [29] "ID" "Jumper.No." ## [31] "Playing.for" "Kicks" ## [33] "Marks" "Handballs" ## [35] "Goals" "Behinds" ## [37] "Hit.Outs" "Tackles" ## [39] "Rebounds" "Inside.50s" ## [41] "Clearances" "Clangers" ## [43] "Frees.For" "Frees.Against" ## [45] "Brownlow.Votes" "Contested.Possessions" ## [47] "Uncontested.Possessions" "Contested.Marks" ## [49] "Marks.Inside.50" "One.Percenters" ## [51] "Bounces" "Goal.Assists" ## [53] "Time.on.Ground.." "Substitute" ## [55] "Umpire.1" "Umpire.2" ## [57] "Umpire.3" "Umpire.4" ## [59] "group_id" team_stats<-df%>% dplyr::select(Date,Home.team, Away.team, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists)%>% group_by(Date, Season, Round, Playing.for)%>% summarise_if(is.numeric,funs(sum=c(sum(.)))) player_stats<-df%>% dplyr::select(Date, First.name,Surname,Season, Round, Playing.for, Kicks:Goal.Assists) ## Adding missing grouping variables: `Home.team`, `Away.team` complete_df<-left_join(player_stats,team_stats, by=c("Date"="Date", "Season"="Season", "Playing.for"="Playing.for")) #but we also need margins as per honours stuff dataset_scores<-fitzRoy::match_results names(dataset_scores) ## [1] "Game" "Date" "Round" "Home.Team" ## [5] "Home.Goals" "Home.Behinds" "Home.Points" "Away.Team" ## [9] "Away.Goals" "Away.Behinds" "Away.Points" "Venue" ## [13] "Margin" "Season" "Round.Type" "Round.Number" dataset_scores1<-dataset_scores%>%dplyr::select (Date, Round, Home.Team, Home.Points,Game) dataset_scores2<-dplyr::select(dataset_scores, Date, Round, Away.Team, Away.Points,Game) colnames(dataset_scores1)[3]<-"Team" colnames(dataset_scores1)[4]<-"Points" colnames(dataset_scores2)[3]<-"Team" colnames(dataset_scores2)[4]<-"Points" df5<-rbind(dataset_scores1,dataset_scores2) dataset_margins<-df5%>%group_by(Game)%>% arrange(Game)%>% mutate(margin=c(-diff(Points),diff(Points))) # View(dataset_margins) dataset_margins$Date<-as.Date(dataset_margins$Date) complete_df$Date<-as.Date(complete_df$Date) complete_df<-left_join(complete_df,dataset_margins,by=c("Date"="Date", "Playing.for"="Team")) complete_df_ratio<-complete_df%>% mutate(kick.ratio=Kicks/Kicks_sum, Marks.ratio=Marks/Marks_sum, handball.ratio=Handballs/Handballs_sum, Goals.ratio=Goals/Goals_sum, behinds.ratio=Behinds/Behinds_sum, hitouts.ratio=Hit.Outs/Hit.Outs_sum, tackles.ratio=Tackles/Tackles_sum, rebounds.ratio=Rebounds/Rebounds_sum, inside50s.ratio=Inside.50s/Inside.50s_sum, clearances.ratio=Clearances/Clearances_sum, clangers.ratio=Clangers/Clangers_sum, freefors.ratio=Frees.For/Frees.For_sum, freesagainst.ratio=Frees.Against/Frees.Against_sum, Contested.Possessions.ratio=Contested.Possessions/Contested.Possessions_sum, Uncontested.Possessions.ratio=Uncontested.Possessions/Uncontested.Possessions_sum, contested.marks.ratio=Contested.Marks/Contested.Marks_sum, marksinside50.ratio=Marks.Inside.50/Marks.Inside.50_sum, one.percenters.ratio=One.Percenters/One.Percenters_sum, bounces.ratio=Bounces/Bounces_sum, goal.assists.ratio=Goal.Assists/Goal.Assists_sum, disposals.ratio=(Kicks+Handballs)/(Kicks_sum+Handballs_sum)) df<-complete_df_ratio%>%dplyr::select(Date, First.name, Surname, Season, Round.x, Playing.for,-Brownlow.Votes, Brownlow.Votes_sum,everything()) df<-df%>%dplyr::select(-Brownlow.Votes,everything()) df[is.na(df)] <- 0 in.sample <- subset(df, Season %in% c(2013:2016)) # out.sample <- subset(df, Season == 2014) in.sample$Brownlow.Votes <- factor(in.sample$Brownlow.Votes) in.sample<-in.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8", "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24")) in.sample$Player<-paste(in.sample$First.name,in.sample$Surname) in.sample<-in.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes) temp1<-scale(in.sample[,8:29]) in.sample[,8:29]<-temp1 #attributes(temp1) temp1.center<-attr(temp1,"scaled:center") temp1.scale<-attr(temp1,"scaled:scale") library(ordinal) fm1<-clm(Brownlow.Votes~ handball.ratio + Marks.ratio + disposals.ratio+ hitouts.ratio+ freefors.ratio + freesagainst.ratio + tackles.ratio + Goals.ratio + behinds.ratio + Contested.Possessions.ratio+ Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + marksinside50.ratio + clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + bounces.ratio+ goal.assists.ratio +margin, data = in.sample) library(MASS) fm2<- stepAIC(fm1, direction='backward',type=AIC) ## Start: AIC=13631.84 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## Uncontested.Possessions.ratio + clangers.ratio + contested.marks.ratio + ## marksinside50.ratio + clearances.ratio + rebounds.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## goal.assists.ratio + margin ## ## Df AIC ## - Uncontested.Possessions.ratio 1 13630 ## - rebounds.ratio 1 13630 ## - behinds.ratio 1 13631 ## - goal.assists.ratio 1 13631 ## - inside50s.ratio 1 13631 ## <none> 13632 ## - freesagainst.ratio 1 13638 ## - Contested.Possessions.ratio 1 13639 ## - clearances.ratio 1 13639 ## - freefors.ratio 1 13641 ## - bounces.ratio 1 13647 ## - clangers.ratio 1 13650 ## - marksinside50.ratio 1 13650 ## - Marks.ratio 1 13650 ## - handball.ratio 1 13653 ## - contested.marks.ratio 1 13656 ## - one.percenters.ratio 1 13669 ## - tackles.ratio 1 13690 ## - hitouts.ratio 1 13790 ## - disposals.ratio 1 13863 ## - Goals.ratio 1 14583 ## - margin 1 15649 ## ## Step: AIC=13630.1 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## clangers.ratio + contested.marks.ratio + marksinside50.ratio + ## clearances.ratio + rebounds.ratio + inside50s.ratio + one.percenters.ratio + ## bounces.ratio + goal.assists.ratio + margin ## ## Df AIC ## - rebounds.ratio 1 13628 ## - behinds.ratio 1 13629 ## - goal.assists.ratio 1 13629 ## - inside50s.ratio 1 13629 ## <none> 13630 ## - freesagainst.ratio 1 13636 ## - clearances.ratio 1 13638 ## - freefors.ratio 1 13639 ## - bounces.ratio 1 13645 ## - clangers.ratio 1 13648 ## - Marks.ratio 1 13648 ## - marksinside50.ratio 1 13648 ## - handball.ratio 1 13652 ## - contested.marks.ratio 1 13655 ## - Contested.Possessions.ratio 1 13661 ## - one.percenters.ratio 1 13667 ## - tackles.ratio 1 13688 ## - hitouts.ratio 1 13789 ## - Goals.ratio 1 14581 ## - disposals.ratio 1 14601 ## - margin 1 15647 ## ## Step: AIC=13628.54 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + behinds.ratio + Contested.Possessions.ratio + ## clangers.ratio + contested.marks.ratio + marksinside50.ratio + ## clearances.ratio + inside50s.ratio + one.percenters.ratio + ## bounces.ratio + goal.assists.ratio + margin ## ## Df AIC ## - behinds.ratio 1 13627 ## - goal.assists.ratio 1 13627 ## - inside50s.ratio 1 13628 ## <none> 13628 ## - freesagainst.ratio 1 13635 ## - clearances.ratio 1 13636 ## - freefors.ratio 1 13638 ## - bounces.ratio 1 13644 ## - marksinside50.ratio 1 13646 ## - clangers.ratio 1 13646 ## - Marks.ratio 1 13647 ## - contested.marks.ratio 1 13654 ## - handball.ratio 1 13655 ## - Contested.Possessions.ratio 1 13660 ## - one.percenters.ratio 1 13668 ## - tackles.ratio 1 13686 ## - hitouts.ratio 1 13787 ## - Goals.ratio 1 14591 ## - disposals.ratio 1 14799 ## - margin 1 15645 ## ## Step: AIC=13627.36 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## goal.assists.ratio + margin ## ## Df AIC ## - goal.assists.ratio 1 13626 ## - inside50s.ratio 1 13627 ## <none> 13627 ## - freesagainst.ratio 1 13634 ## - clearances.ratio 1 13634 ## - freefors.ratio 1 13637 ## - bounces.ratio 1 13643 ## - Marks.ratio 1 13646 ## - clangers.ratio 1 13646 ## - marksinside50.ratio 1 13651 ## - contested.marks.ratio 1 13652 ## - handball.ratio 1 13656 ## - Contested.Possessions.ratio 1 13660 ## - one.percenters.ratio 1 13666 ## - tackles.ratio 1 13685 ## - hitouts.ratio 1 13786 ## - Goals.ratio 1 14591 ## - disposals.ratio 1 14806 ## - margin 1 15646 ## ## Step: AIC=13626.24 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## inside50s.ratio + one.percenters.ratio + bounces.ratio + ## margin ## ## Df AIC ## - inside50s.ratio 1 13626 ## <none> 13626 ## - clearances.ratio 1 13633 ## - freesagainst.ratio 1 13633 ## - freefors.ratio 1 13636 ## - bounces.ratio 1 13642 ## - Marks.ratio 1 13644 ## - clangers.ratio 1 13645 ## - marksinside50.ratio 1 13651 ## - contested.marks.ratio 1 13651 ## - handball.ratio 1 13654 ## - Contested.Possessions.ratio 1 13659 ## - one.percenters.ratio 1 13664 ## - tackles.ratio 1 13684 ## - hitouts.ratio 1 13784 ## - Goals.ratio 1 14590 ## - disposals.ratio 1 14804 ## - margin 1 15645 ## ## Step: AIC=13625.93 ## Brownlow.Votes ~ handball.ratio + Marks.ratio + disposals.ratio + ## hitouts.ratio + freefors.ratio + freesagainst.ratio + tackles.ratio + ## Goals.ratio + Contested.Possessions.ratio + clangers.ratio + ## contested.marks.ratio + marksinside50.ratio + clearances.ratio + ## one.percenters.ratio + bounces.ratio + margin ## ## Df AIC ## <none> 13626 ## - freesagainst.ratio 1 13633 ## - clearances.ratio 1 13634 ## - freefors.ratio 1 13635 ## - bounces.ratio 1 13642 ## - Marks.ratio 1 13643 ## - clangers.ratio 1 13644 ## - marksinside50.ratio 1 13650 ## - contested.marks.ratio 1 13650 ## - Contested.Possessions.ratio 1 13658 ## - handball.ratio 1 13662 ## - one.percenters.ratio 1 13663 ## - tackles.ratio 1 13683 ## - hitouts.ratio 1 13784 ## - Goals.ratio 1 14600 ## - disposals.ratio 1 15042 ## - margin 1 15645 out.sample <- subset(df, Season %in% c(2017)) out.sample$Brownlow.Votes <- factor(out.sample$Brownlow.Votes) out.sample<-out.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8", "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24")) names(out.sample) ## [1] "Date" "First.name" ## [3] "Surname" "Season" ## [5] "Round.x" "Playing.for" ## [7] "Brownlow.Votes_sum" "Home.team" ## [9] "Away.team" "Kicks" ## [11] "Marks" "Handballs" ## [13] "Goals" "Behinds" ## [15] "Hit.Outs" "Tackles" ## [17] "Rebounds" "Inside.50s" ## [19] "Clearances" "Clangers" ## [21] "Frees.For" "Frees.Against" ## [23] "Contested.Possessions" "Uncontested.Possessions" ## [25] "Contested.Marks" "Marks.Inside.50" ## [27] "One.Percenters" "Bounces" ## [29] "Goal.Assists" "Round.y" ## [31] "Kicks_sum" "Marks_sum" ## [33] "Handballs_sum" "Goals_sum" ## [35] "Behinds_sum" "Hit.Outs_sum" ## [37] "Tackles_sum" "Rebounds_sum" ## [39] "Inside.50s_sum" "Clearances_sum" ## [41] "Clangers_sum" "Frees.For_sum" ## [43] "Frees.Against_sum" "Contested.Possessions_sum" ## [45] "Uncontested.Possessions_sum" "Contested.Marks_sum" ## [47] "Marks.Inside.50_sum" "One.Percenters_sum" ## [49] "Bounces_sum" "Goal.Assists_sum" ## [51] "Round" "Points" ## [53] "Game" "margin" ## [55] "kick.ratio" "Marks.ratio" ## [57] "handball.ratio" "Goals.ratio" ## [59] "behinds.ratio" "hitouts.ratio" ## [61] "tackles.ratio" "rebounds.ratio" ## [63] "inside50s.ratio" "clearances.ratio" ## [65] "clangers.ratio" "freefors.ratio" ## [67] "freesagainst.ratio" "Contested.Possessions.ratio" ## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio" ## [71] "marksinside50.ratio" "one.percenters.ratio" ## [73] "bounces.ratio" "goal.assists.ratio" ## [75] "disposals.ratio" "Brownlow.Votes" out.sample$Player<-paste(out.sample$First.name,out.sample$Surname) out.sample<-out.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes) out.sample[is.na(out.sample)] <- 0 newdata <- out.sample[ , -ncol(out.sample)] newdata[,8:29]<-scale(newdata[,8:29],center=temp1.center,scale=temp1.scale) pre.dict <- predict(fm2,newdata=newdata, type='prob') pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("vote.0", "vote.1", "vote.2", "vote.3") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred$expected.votes <- newdata.pred$vote.1 + 2*newdata.pred$vote.2 + 3*newdata.pred$vote.3 newdata.pred$Match_id<-paste(newdata.pred$Season, newdata.pred$Round.x, newdata.pred$Home.team) sum1 <- aggregate(vote.1~Match_id, data = newdata.pred, FUN = sum ); names(sum1) <- c("Match_id", "sum.vote.1"); sum2 <- aggregate(vote.2~Match_id, data = newdata.pred, FUN = sum ); names(sum2) <- c("Match_id", "sum.vote.2"); sum3 <- aggregate(vote.3~Match_id, data = newdata.pred, FUN = sum ); names(sum3) <- c("Match_id", "sum.vote.3"); #### Step 3: Add sum of each vote by matchId to big table newdata.pred <- merge(newdata.pred, sum1, by = "Match_id") newdata.pred <- merge(newdata.pred, sum2, by = "Match_id") newdata.pred <- merge(newdata.pred, sum3, by = "Match_id") #### Step 4: Add std1/2/3 newdata.pred$std.1 <- (newdata.pred$sum.vote.1/newdata.pred$vote.1)^-1 newdata.pred$std.2 <- (newdata.pred$sum.vote.2/newdata.pred$vote.2)^-1 newdata.pred$std.3 <- (newdata.pred$sum.vote.3/newdata.pred$vote.3)^-1 #### Step 5: Expected standard game vote newdata.pred$exp_std_game_vote <- newdata.pred$std.1 + 2*newdata.pred$std.2 + 3*newdata.pred$std.3 #### Step 6: List of winners newdata.pred$PlayerName<-paste(newdata.pred$Player," ",newdata.pred$Playing.for) winners.stdgame <- aggregate(exp_std_game_vote~PlayerName, data = newdata.pred, FUN = sum ); winners.stdgame <- winners.stdgame[order(-winners.stdgame$exp_std_game_vote), ] winners.stdgame[1:10, ] ## PlayerName exp_std_game_vote ## 168 Dustin Martin Richmond 33.78545 ## 490 Patrick Dangerfield Geelong 31.31288 ## 617 Tom Mitchell Hawthorn 28.73881 ## 157 Dayne Zorko Brisbane Lions 18.52044 ## 513 Rory Sloane Adelaide 16.85977 ## 340 Josh Kelly Greater Western Sydney 16.70946 ## 653 Zach Merrett Essendon 16.42835 ## 424 Matt Crouch Adelaide 16.00151 ## 341 Josh Kennedy Sydney 15.23837 ## 483 Ollie Wines Port Adelaide 15.08383
What you might notice about this is that the order is slightly different to what is presented earlier. But what is the different? The out.sample uses the afltables data only no footywire data is used at all. In other words the above script just showed that there are differences between the data on AFLtables and footywire!!!!!
Recall earlier, one of the ways that Brownlow Professor makes an assessment of his model is he looks at home many times that the player ranked within the top 5 get the 3 votes 95% of the time, 84% of the time for the 2 votes and 74% of the time for the one vote.
Well using the above method, we could check it for the 2017 AFL season.
To do this we need to join on the actual votes that happened in 2018 to the predicted probabilities based on our model
out.sample <- subset(df, Season %in% c(2017)) out.sample$Brownlow.Votes <- factor(out.sample$Brownlow.Votes) out.sample<-out.sample%>%filter(Round.x %in% c("1","2","3","4","5","6","7","8", "9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24")) names(out.sample) ## [1] "Date" "First.name" ## [3] "Surname" "Season" ## [5] "Round.x" "Playing.for" ## [7] "Brownlow.Votes_sum" "Home.team" ## [9] "Away.team" "Kicks" ## [11] "Marks" "Handballs" ## [13] "Goals" "Behinds" ## [15] "Hit.Outs" "Tackles" ## [17] "Rebounds" "Inside.50s" ## [19] "Clearances" "Clangers" ## [21] "Frees.For" "Frees.Against" ## [23] "Contested.Possessions" "Uncontested.Possessions" ## [25] "Contested.Marks" "Marks.Inside.50" ## [27] "One.Percenters" "Bounces" ## [29] "Goal.Assists" "Round.y" ## [31] "Kicks_sum" "Marks_sum" ## [33] "Handballs_sum" "Goals_sum" ## [35] "Behinds_sum" "Hit.Outs_sum" ## [37] "Tackles_sum" "Rebounds_sum" ## [39] "Inside.50s_sum" "Clearances_sum" ## [41] "Clangers_sum" "Frees.For_sum" ## [43] "Frees.Against_sum" "Contested.Possessions_sum" ## [45] "Uncontested.Possessions_sum" "Contested.Marks_sum" ## [47] "Marks.Inside.50_sum" "One.Percenters_sum" ## [49] "Bounces_sum" "Goal.Assists_sum" ## [51] "Round" "Points" ## [53] "Game" "margin" ## [55] "kick.ratio" "Marks.ratio" ## [57] "handball.ratio" "Goals.ratio" ## [59] "behinds.ratio" "hitouts.ratio" ## [61] "tackles.ratio" "rebounds.ratio" ## [63] "inside50s.ratio" "clearances.ratio" ## [65] "clangers.ratio" "freefors.ratio" ## [67] "freesagainst.ratio" "Contested.Possessions.ratio" ## [69] "Uncontested.Possessions.ratio" "contested.marks.ratio" ## [71] "marksinside50.ratio" "one.percenters.ratio" ## [73] "bounces.ratio" "goal.assists.ratio" ## [75] "disposals.ratio" "Brownlow.Votes" out.sample$Player<-paste(out.sample$First.name,out.sample$Surname) out.sample<-out.sample%>%dplyr::select(Player, Date, Season, Round.x, Playing.for, Home.team, Away.team, margin:Brownlow.Votes) fulldataframe<-left_join(newdata.pred,out.sample,by=c("Date"="Date", "Playing.for"="Playing.for", "Player"="Player"))
From there we might want to create a pivot table where we can see how often the highest ranked person by say expected standardised votes gets the 3 votes and so on.
To do that we can use this bit of script below, it requires a handy package called janitor
xx<-fulldataframe %>% group_by(Match_id) %>% mutate(my_ranks = order(order(exp_std_game_vote, decreasing=TRUE))) table(xx$my_ranks, xx$Brownlow.Votes) ## ## 0 1 2 3 ## 1 32 22 35 109 ## 2 63 33 59 43 ## 3 112 31 33 22 ## 4 133 36 24 5 ## 5 170 17 8 3 ## 6 171 11 9 7 ## 7 183 5 6 4 ## 8 181 11 5 1 ## 9 185 8 4 1 ## 10 190 6 2 0 ## 11 193 2 2 1 ## 12 195 2 1 0 ## 13 194 3 1 0 ## 14 193 2 2 1 ## 15 196 0 2 0 ## 16 196 1 1 0 ## 17 196 1 1 0 ## 18 197 0 1 0 ## 19 196 1 0 1 ## 20 196 1 1 0 ## 21 195 2 1 0 ## 22 198 0 0 0 ## 23 198 0 0 0 ## 24 197 1 0 0 ## 25 197 1 0 0 ## 26 197 1 0 0 ## 27 198 0 0 0 ## 28 198 0 0 0 ## 29 198 0 0 0 ## 30 198 0 0 0 ## 31...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.