But We Won Everywhere but the Scoreboard
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Something that gets to many a footy fan, is the feeling that your team has won the game in most areas expect on the scoreboard.
Thinking about this statement a little bit deeper has the following implication. That there are some areas of the game, that if you win, you tend to win the game.
We can think about this by putting it into a model.
The data we will use here is the game differentials of the various game statistics that are available using fitzRoy
The general idea is its fairly obvious if you score more than the opposition you win. Shocking I know! What we are trying to do here, is come up with a concept of we won other things but not the scoreboard, usually when we win these things we tend to win a game.
So we know thanks to fitzRoy we have access to the data on footywire, which has a few extra variables that are not on afltables.
So imagine this scenario, you don’t have access to the scores, only the in game statistics as per these pages, what variables/differentials would you look at to decide to will win?
We know that for example, looking only at free kick differential isn’t very predictive. Is there a combination of differentials that when a team wins they are more likely to win? Lets build a model to find out.
The model I will use here, is a logistic regression model with the binary outcome being win/loss.
Step One Build the dataset
library(tidyverse) ## -- Attaching packages -------------------------------- tidyverse 1.2.1 -- ## v ggplot2 2.2.1 v purrr 0.2.5 ## v tibble 1.4.2 v dplyr 0.7.5 ## v tidyr 0.8.1 v stringr 1.3.1 ## v readr 1.1.1 v forcats 0.3.0 ## -- Conflicts ----------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate(diff_cp=c(-diff(CP), diff(CP))) ## # A tibble: 1,462 x 43 ## # Groups: Match_id [731] ## Date Season Round Team Status Opposition Venue Match_id CP ## <date> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <int> ## 1 2015-04-02 2015 Round~ Carlton Home Richmond MCG 5964 127 ## 2 2015-04-02 2015 Round~ Richmo~ Away Carlton MCG 5964 121 ## 3 2015-04-04 2015 Round~ Gold C~ Away Melbourne MCG 5965 136 ## 4 2015-04-04 2015 Round~ Melbou~ Home Gold Coast MCG 5965 150 ## 5 2015-04-04 2015 Round~ Essend~ Away Sydney ANZ ~ 5966 150 ## 6 2015-04-04 2015 Round~ Sydney Home Essendon ANZ ~ 5966 176 ## 7 2015-04-04 2015 Round~ Brisba~ Home Collingwo~ Gabba 5967 152 ## 8 2015-04-04 2015 Round~ Collin~ Away Brisbane Gabba 5967 165 ## 9 2015-04-04 2015 Round~ West C~ Away Western B~ Etih~ 5968 137 ## 10 2015-04-04 2015 Round~ Wester~ Home West Coast Etih~ 5968 156 ## # ... with 1,452 more rows, and 34 more variables: UP <int>, ED <int>, ## # DE <dbl>, CM <int>, GA <int>, MI5 <int>, One.Percenters <int>, ## # BO <int>, TOG <int>, K <int>, HB <int>, D <int>, M <int>, G <int>, ## # B <int>, T <int>, HO <int>, GA1 <int>, I50 <int>, CL <int>, CG <int>, ## # R50 <int>, FF <int>, FA <int>, AF <int>, SC <int>, CCL <int>, ## # SCL <int>, SI <int>, MG <int>, TO <int>, ITC <int>, T5 <int>, ## # diff_cp <int>
Something to remember with fitzRoy is that footywire doesn’t have tackles inside 50, meters gained etc for games previous to 2015, so this becomes our first filter
The next thing we do, is we select
all the columns except for Player, the reason being is that we don’t need it and also its nice to see how to “deselect” columns as well as select them.
Remember we want data at a game level for each team and we want to be able to come up with the in game differential, this is where our next group_by
comes in handy and its related to the summarise_all
.
You might be thinking jeez mate that’s a lot to group by can’t you just use Match_id as they are already unique. Yes that is true, I could have, but one thing about summarise_all
is that is summarises every column that is not in the group_by
. You can test this out by running below, it should spit out an error message. Error in summarise_impl(.data, dots) : Evaluation error: invalid 'type' (character) of argument.
fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by( Match_id)%>% summarise_all(.funs = sum)
After the summarise_all
, we group by Match_id
so we can find the differentials by Match_id
so we can come up with as an example the contested possession differential diff_cp
. I like to arrange
the dataset so I can do sanity checks vs footywire.
So the next thing you might be thinking, is jeez doing that for all variables finding their differentials seems a bit tedious.
That is where we can use mutate_if
, we can just mutate
all columns that is.numeric
and is not in the group_by
which is another way we could have done original summarise_all
instead we could have used summarise_if
BUT match_id is numeric!
You can check this by using
str(fitzRoy::player_stats) ## 'data.frame': 76296 obs. of 43 variables: ## $ Date : Date, format: "2010-03-25" "2010-03-25" ... ## $ Season : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ... ## $ Round : chr "Round 1" "Round 1" "Round 1" "Round 1" ... ## $ Venue : chr "MCG" "MCG" "MCG" "MCG" ... ## $ Player : chr "Daniel Connors" "Daniel Jackson" "Brett Deledio" "Ben Cousins" ... ## $ Team : chr "Richmond" "Richmond" "Richmond" "Richmond" ... ## $ Opposition : chr "Carlton" "Carlton" "Carlton" "Carlton" ... ## $ Status : chr "Home" "Home" "Home" "Home" ... ## $ Match_id : num 5089 5089 5089 5089 5089 ... ## $ CP : int 8 11 7 9 8 6 7 7 6 7 ... ## $ UP : int 15 10 14 10 10 12 10 6 7 5 ... ## $ ED : int 16 14 16 11 13 16 13 7 10 7 ... ## $ DE : num 66.7 60.9 76.2 57.9 68.4 88.9 76.5 50 76.9 53.8 ... ## $ CM : int 0 1 0 0 1 0 0 0 1 0 ... ## $ GA : int 0 0 0 1 0 0 0 1 0 0 ... ## $ MI5 : int 0 0 0 0 0 0 0 2 0 0 ... ## $ One.Percenters: int 1 0 0 0 0 1 5 2 5 1 ... ## $ BO : int 0 0 0 0 1 0 0 0 0 0 ... ## $ TOG : int 69 80 89 69 77 81 84 80 100 88 ... ## $ K : int 14 11 12 13 11 5 7 9 6 7 ... ## $ HB : int 10 12 9 6 8 13 10 5 7 6 ... ## $ D : int 24 23 21 19 19 18 17 14 13 13 ... ## $ M : int 3 2 5 1 6 4 2 3 4 2 ... ## $ G : int 0 0 1 1 0 0 0 1 0 0 ... ## $ B : int 0 0 0 0 0 0 0 1 0 0 ... ## $ T : int 1 5 6 1 1 3 2 5 4 4 ... ## $ HO : int 0 0 0 0 0 0 0 0 0 0 ... ## $ GA1 : int 0 0 0 1 0 0 0 1 0 0 ... ## $ I50 : int 2 8 4 1 2 2 1 5 0 3 ... ## $ CL : int 2 5 3 2 3 3 4 4 1 1 ... ## $ CG : int 4 4 4 3 3 1 2 0 2 0 ... ## $ R50 : int 6 1 3 4 2 0 2 0 3 1 ... ## $ FF : int 2 2 1 1 0 0 1 4 1 1 ... ## $ FA : int 0 0 2 0 2 1 0 0 0 0 ... ## $ AF : int 77 85 94 65 65 62 56 77 61 56 ... ## $ SC : int 85 89 93 70 63 72 79 73 68 59 ... ## $ CCL : int NA NA NA NA NA NA NA NA NA NA ... ## $ SCL : int NA NA NA NA NA NA NA NA NA NA ... ## $ SI : int NA NA NA NA NA NA NA NA NA NA ... ## $ MG : int NA NA NA NA NA NA NA NA NA NA ... ## $ TO : int NA NA NA NA NA NA NA NA NA NA ... ## $ ITC : int NA NA NA NA NA NA NA NA NA NA ... ## $ T5 : int NA NA NA NA NA NA NA NA NA NA ...
So hopefully you have seen a couple of short cuts that come from knowing the dataset and this is an example of why its important to do your checks!
Here is a great a guide to bad data which provides a nice series of steps in how to check things that might commonly go wrong.
fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player,-Date)%>% group_by(Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) ## # A tibble: 1,462 x 76 ## # Groups: Match_id [731] ## Season Round Team Status Opposition Venue Match_id CP UP ED ## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <int> <int> ## 1 2015 Round~ Carlt~ Home Richmond MCG 5964 127 199 233 ## 2 2015 Round~ Richm~ Away Carlton MCG 5964 121 227 273 ## 3 2015 Round~ Gold ~ Away Melbourne MCG 5965 136 165 203 ## 4 2015 Round~ Melbo~ Home Gold Coast MCG 5965 150 198 246 ## 5 2015 Round~ Essen~ Away Sydney ANZ ~ 5966 150 171 226 ## 6 2015 Round~ Sydney Home Essendon ANZ ~ 5966 176 192 248 ## 7 2015 Round~ Brisb~ Home Collingwo~ Gabba 5967 152 211 246 ## 8 2015 Round~ Colli~ Away Brisbane Gabba 5967 165 179 231 ## 9 2015 Round~ West ~ Away Western B~ Etih~ 5968 137 236 278 ## 10 2015 Round~ Weste~ Home West Coast Etih~ 5968 156 214 275 ## # ... with 1,452 more rows, and 66 more variables: DE <dbl>, CM <int>, ## # GA <int>, MI5 <int>, One.Percenters <int>, BO <int>, TOG <int>, ## # K <int>, HB <int>, D <int>, M <int>, G <int>, B <int>, T <int>, ## # HO <int>, GA1 <int>, I50 <int>, CL <int>, CG <int>, R50 <int>, ## # FF <int>, FA <int>, AF <int>, SC <int>, CCL <int>, SCL <int>, ## # SI <int>, MG <int>, TO <int>, ITC <int>, T5 <int>, ## # Season_difference <dbl>, CP_difference <int>, UP_difference <int>, ## # ED_difference <int>, DE_difference <dbl>, CM_difference <int>, ## # GA_difference <int>, MI5_difference <int>, ## # One.Percenters_difference <int>, BO_difference <int>, ## # TOG_difference <int>, K_difference <int>, HB_difference <int>, ## # D_difference <int>, M_difference <int>, G_difference <int>, ## # B_difference <int>, T_difference <int>, HO_difference <int>, ## # GA1_difference <int>, I50_difference <int>, CL_difference <int>, ## # CG_difference <int>, R50_difference <int>, FF_difference <int>, ## # FA_difference <int>, AF_difference <int>, SC_difference <int>, ## # CCL_difference <int>, SCL_difference <int>, SI_difference <int>, ## # MG_difference <int>, TO_difference <int>, ITC_difference <int>, ## # T5_difference <int>
Step Two – Joining Datasets
Now we could just sum up the goals by each player and their behinds, but this would miss out on rushed behinds. So lets join our footywire dataset with the afltables match results
df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points) df4<-select(df2, Date, Round, Away.Team, Away.Points) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,44:80,81) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns)
Taking Submissions.
I asked twitter, what areas do you think are important to win, to win a game. I got a response and here we go!
So from our above script, all we need to be able to add in is goal accuracy as a predictor or goal accuracy differential?
df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds) df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds) df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds)) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df3)[5]<-"Goals" colnames(df3)[6]<-"Behinds" df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds)) colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" colnames(df4)[5]<-"Goals" colnames(df4)[6]<-"Behinds" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,44:77,82:84) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns)
Building the Logistic Regression Model
library(aod) library(ordinal) ## ## Attaching package: 'ordinal' ## The following object is masked from 'package:dplyr': ## ## slice library(lme4) ## Loading required package: Matrix ## ## Attaching package: 'Matrix' ## The following object is masked from 'package:tidyr': ## ## expand ## ## Attaching package: 'lme4' ## The following objects are masked from 'package:ordinal': ## ## ranef, VarCorr in.sample <- subset(dataset, Season %in% c(2015:2017)) #in.sample <- subset(mydata, year ==2008) out.sample <- subset(dataset, Season == 2018) in.sample$Win_loss <- factor(in.sample$Win_loss) out.sample$Win_loss<-factor(out.sample$Win_loss)
To know which columsn we want to scale an easy way is to go names(dataset) which should print out the column names.
temp1<-scale(in.sample[,7:40]) in.sample[,7:40]<-temp1 #attributes(temp1) temp1.center<-attr(temp1,"scaled:center") temp1.scale<-attr(temp1,"scaled:scale") m <- glm(Win_loss ~I50_difference+ Accuracy+ R50_difference+ CCL_difference+ SCL_difference+ MI5_difference , data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + Accuracy + R50_difference + ## CCL_difference + SCL_difference + MI5_difference, family = binomial, ## data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.43728 -0.28649 -0.00212 0.26964 2.63087 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.45340 0.58793 -5.874 4.26e-09 *** ## I50_difference 7.17314 0.51782 13.853 < 2e-16 *** ## Accuracy 6.43892 1.09304 5.891 3.84e-09 *** ## R50_difference 5.11442 0.38536 13.272 < 2e-16 *** ## CCL_difference 0.16421 0.11256 1.459 0.145 ## SCL_difference 0.04502 0.10581 0.425 0.671 ## MI5_difference 0.87950 0.16978 5.180 2.22e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 620.62 on 1233 degrees of freedom ## AIC: 634.62 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Sydney North Melbourne SCG -2 0.9368923 ## 2 Sydney Adelaide SCG -10 0.8578722 ## 3 Essendon Carlton MCG -13 0.8239973 ## 4 Gold Coast St Kilda Metricon Stadium -2 0.7796465 ## 5 Essendon Fremantle Optus Stadium -16 0.5778875 ## 6 St Kilda West Coast Optus Stadium -13 0.5625316 ## 7 North Melbourne Richmond Etihad Stadium -10 0.4065246 ## 8 Port Adelaide Hawthorn UTAS Stadium -3 0.4063489 ## 9 Brisbane Collingwood Gabba -7 0.3765427 ## 10 Adelaide Port Adelaide Adelaide Oval -5 0.3601077
The good thing about having a template sorted out, is that you can make quick changes as you think of other variables you want to test.
For example JT asked about kicks, inside 50s, marks in 50, tackles in 50 and meters gained.
Well lets look at his list of games
m <- glm(Win_loss ~I50_difference+ K_difference+ I50_difference+ MI5_difference+ T5_difference+ MG_difference, data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + K_difference + I50_difference + ## MI5_difference + T5_difference + MG_difference, family = binomial, ## data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.68051 -0.29449 -0.00427 0.29706 2.71729 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.05097 0.10101 -0.505 0.613808 ## I50_difference -0.50237 0.17403 -2.887 0.003892 ** ## K_difference 0.47670 0.15356 3.104 0.001907 ** ## MI5_difference 0.63284 0.17320 3.654 0.000258 *** ## T5_difference -0.03405 0.11939 -0.285 0.775503 ## MG_difference 4.19630 0.31214 13.444 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 631.05 on 1234 degrees of freedom ## AIC: 643.05 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Sydney Adelaide SCG -10 0.9441431 ## 2 Essendon Carlton MCG -13 0.9373812 ## 3 Brisbane Gold Coast Gabba -5 0.8991898 ## 4 Western Bulldogs Sydney Etihad Stadium -7 0.7757990 ## 5 Collingwood Geelong MCG -21 0.7610453 ## 6 Port Adelaide Geelong Adelaide Oval -34 0.6678467 ## 7 Sydney North Melbourne SCG -2 0.5908486 ## 8 Geelong West Coast Optus Stadium -15 0.5313548 ## 9 Hawthorn West Coast Etihad Stadium -15 0.4912265 ## 10 GWS Sydney SCG -16 0.3831652
Another example Troy Wheatley.
m <- glm(Win_loss ~I50_difference+ MG_difference+ CP_difference+ CM_difference+ CCL_difference+ CL_difference+ HO_difference+ ITC_difference+ D_difference+ CG_difference+ R50_difference+ One.Percenters_difference, data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ I50_difference + MG_difference + CP_difference + ## CM_difference + CCL_difference + CL_difference + HO_difference + ## ITC_difference + D_difference + CG_difference + R50_difference + ## One.Percenters_difference, family = binomial, data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.09976 -0.24111 -0.00167 0.23880 3.13712 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.05870 0.10841 -0.541 0.58818 ## I50_difference 3.63170 0.68743 5.283 1.27e-07 *** ## MG_difference 3.06213 0.41218 7.429 1.09e-13 *** ## CP_difference -0.13434 0.16628 -0.808 0.41915 ## CM_difference 0.18235 0.13033 1.399 0.16176 ## CCL_difference 0.40136 0.14838 2.705 0.00683 ** ## CL_difference 0.06447 0.19133 0.337 0.73613 ## HO_difference 0.15043 0.11475 1.311 0.18988 ## ITC_difference 0.40376 0.18444 2.189 0.02859 * ## D_difference -0.12950 0.17347 -0.747 0.45534 ## CG_difference -0.54681 0.12910 -4.235 2.28e-05 *** ## R50_difference 3.20051 0.48150 6.647 2.99e-11 *** ## One.Percenters_difference -0.07831 0.12351 -0.634 0.52606 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1718.92 on 1239 degrees of freedom ## Residual deviance: 556.62 on 1227 degrees of freedom ## AIC: 582.62 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Essendon Carlton MCG -13 0.9225119 ## 2 Sydney Adelaide SCG -10 0.8943902 ## 3 Sydney North Melbourne SCG -2 0.8911554 ## 4 Adelaide Fremantle Optus Stadium -3 0.5566460 ## 5 Collingwood Geelong MCG -21 0.5081333 ## 6 GWS Sydney SCG -16 0.4832603 ## 7 Gold Coast St Kilda Metricon Stadium -2 0.4386997 ## 8 North Melbourne Richmond Etihad Stadium -10 0.4137897 ## 9 Hawthorn West Coast Etihad Stadium -15 0.3802405 ## 10 Brisbane Port Adelaide Adelaide Oval -5 0.3625098
We have an idea from insightlane
df<-fitzRoy::player_stats%>% filter(Season>2014)%>% select(-Player)%>% group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>% summarise_all(.funs = sum)%>% mutate(disposaltoturnover=D/TO)%>% group_by(Match_id)%>% arrange(Match_id)%>% mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.)))) df2<-fitzRoy::match_results df2<-df2%>%filter(Season>2014) df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds) df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds) df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds)) colnames(df3)[3]<-"Team" colnames(df3)[4]<-"Points" colnames(df3)[5]<-"Goals" colnames(df3)[6]<-"Behinds" df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds)) colnames(df4)[3]<-"Team" colnames(df4)[4]<-"Points" colnames(df4)[5]<-"Goals" colnames(df4)[6]<-"Behinds" df5<-rbind(df4,df3) df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane")) df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs")) df6<-inner_join(df,df5, by=c("Team","Date")) dataset_columns <- c(1,2,4,6,7,8,45:79,81:86) dataset<-df6%>%group_by(Match_id)%>% arrange(Match_id)%>% mutate(Margin=c(-diff(Points), diff(Points)))%>% mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>% select(dataset_columns) in.sample <- subset(dataset, Season %in% c(2015:2017)) #in.sample <- subset(mydata, year ==2008) out.sample <- subset(dataset, Season == 2018) in.sample$Win_loss <- factor(in.sample$Win_loss) out.sample$Win_loss<-factor(out.sample$Win_loss) temp1<-scale(in.sample[,7:41]) in.sample[,7:41]<-temp1 #attributes(temp1) temp1.center<-attr(temp1,"scaled:center") temp1.scale<-attr(temp1,"scaled:scale") m <- glm(Win_loss ~K_difference+ MG_difference+ disposaltoturnover_difference, data = in.sample, family =binomial) summary(m) ## ## Call: ## glm(formula = Win_loss ~ K_difference + MG_difference + disposaltoturnover_difference, ## family = binomial, data = in.sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.63495 -0.30591 -0.00522 0.30866 2.67106 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.04937 0.09940 -0.497 0.619 ## K_difference 0.70737 0.17504 4.041 5.32e-05 *** ## MG_difference 4.05987 0.29071 13.965 < 2e-16 *** ## disposaltoturnover_difference -0.13159 0.18617 -0.707 0.480 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1716.15 on 1237 degrees of freedom ## Residual deviance: 651.09 on 1234 degrees of freedom ## (2 observations deleted due to missingness) ## AIC: 659.09 ## ## Number of Fisher Scoring iterations: 7 newdata <- out.sample[ , -ncol(out.sample)] newdata[,7:41]<-scale(newdata[,7:41],center=temp1.center,scale=temp1.scale) pre.dict <- predict(m,newdata=newdata, type="response") pre.dict.m <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata))) colnames(pre.dict.m) <- c("prob.win") newdata.pred <- cbind.data.frame(newdata, pre.dict.m) newdata.pred%>% select(Team, Opposition, Venue, Margin, prob.win)%>% filter(Margin<0)%>% arrange(desc(prob.win))%>% top_n(10) ## Selecting by prob.win ## Team Opposition Venue Margin prob.win ## 1 Sydney Adelaide SCG -10 0.9249691 ## 2 Essendon Carlton MCG -13 0.8865205 ## 3 Western Bulldogs Sydney Etihad Stadium -7 0.8724064 ## 4 Brisbane Gold Coast Gabba -5 0.8617977 ## 5 Port Adelaide Geelong Adelaide Oval -34 0.8094733 ## 6 Collingwood Geelong MCG -21 0.7254149 ## 7 Sydney North Melbourne SCG -2 0.6606187 ## 8 Hawthorn West Coast Etihad Stadium -15 0.5825785 ## 9 GWS Sydney SCG -16 0.5081239 ## 10 Brisbane Collingwood Gabba -7 0.3885346
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.