[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After reading data, making a predictions display and building a football data model it is time to put this to validate a bit more (regression plots) and put to usage. It appears that the regression plots in the car package were not very informative for this model, except to find unexpected results. The predictions are not good at all.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Roda JC-FC Utrecht
First of all, when I started this post on Friday, Roda JC played against FC Utrecht, 0-1. FC Utrecht goes sub top. Frankly, I am not very surprised. While I would expect Roda JC to win this game (p=0.46), chance for FC Utrecht to win is 0.33, so that is not so strange but is disappointing for my model.fbpredict(model3,”Roda JC”,”FC Utrecht”)
$details
Roda JC in rows against FC Utrecht in columns
0 1 2 3 4 5 6 7 8 9
0 0.0241 0.0489 0.0494 0.0334 0.0169 0.0068 0.0023 0.0007 0.0002 0
1 0.0410 0.0830 0.0841 0.0567 0.0287 0.0116 0.0039 0.0011 0.0003 0.0001
2 0.0349 0.0706 0.0714 0.0482 0.0244 0.0099 0.0033 0.0010 0.0002 0.0001
3 0.0198 0.0400 0.0405 0.0273 0.0138 0.0056 0.0019 0.0005 0.0001 0
4 0.0084 0.0170 0.0172 0.0116 0.0059 0.0024 0.0008 0.0002 0.0001 0
5 0.0029 0.0058 0.0058 0.0039 0.0020 0.0008 0.0003 0.0001 0 0
6 0.0008 0.0016 0.0017 0.0011 0.0006 0.0002 0.0001 0 0 0
7 0.0002 0.0004 0.0004 0.0003 0.0001 0.0001 0 0 0 0
8 0 0.0001 0.0001 0.0001 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
$`summary chances`
Roda JC equal FC Utrecht
0.4580659 0.2126926 0.3291782
Diagnostic plots
But, I should have looked at some of the diagnostic plots first. Luckily between the stats and the car package we have a nice collection of tools.library(car)
infIndexPlot(model3)
A number of things are interesting. Just under index 100 is the most influential observation. We can track this to row 97. This represents Groningen making six ! goals against Feyenoord. That’s the best FC Groningen ever did against Feyenoord. It was also found to be a remarkable result at that time, so if this were not influential or outlyingI should be worried.
outlierTest(model3)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
97 3.634049 0.00027901 0.17075
OffenseClub DefenseClub Goals OffThuis
97 FC Groningen Feyenoord 6 1
A second ‘feature’in the plots is the higher hat value in the first part of the series. This has to do with the variable OffThuis. Removing OffThuis makes the effect go away. Note that the diagonal of the hat matrix is everywhere the same (within numerical accuracy):
mm <- model.matrix(model3)
table(diag(mm %*% solve(t(mm) %*% mm) %*% t(mm)))
0.0588235294117647 0.0588235294117648 0.0588235294117649
179 427 6
mm <- model.matrix(model3)
table(diag(mm %*% solve(t(mm) %*% mm) %*% t(mm)))
0.0588235294117647 0.0588235294117648 0.0588235294117649
179 427 6
What would be different are the weights of the data. In the non-gaussian GLM framework these are dependent on the expected values. The slightly higher expected values when the attacking club plays home make the hat matrix in general a bit higher. Interpreting these plots is clearly more difficult in the non linear regression case.
ResidualPlot
Residual plot does not reveal much more than the previous plot. Record 97 has a relatively high residual. The advantage of this plot is that the outliers are more clearly numbered. In the sub-plot bottom right the bunching of the residuals in lines is because the observed data is non-continuous; from bottom to top these are the values for 0 to 7 goals. No goals gives a negative Pearson residual. No surprise there.
residualPlots(model3)
Predicting
The teams have been playing in the weekend. The model can be used to predict the outcomes. To start, here are the results. The data in gray represent two teams new in Eredivisie in the season, so I got no predictions.
Roda JC – Utrecht 0-1
PEC Zwolle – Groningen 1-2
RKC Waalwijk – VVV 1-1
Vitesse – Heracles 1-1
NEC – Willem II 0-0
ADO Den Haag – Ajax 1-1
Twente – Heerenveen 1-0
NAC Breda – AZ 2-1
PSV – Feyenoord 3-0
These are the expectations:
club1 club2 win1 equal win2
1 Roda JC FC Utrecht 0.4580659 0.2126926 0.3291782
2 RKC Waalwijk VVV-Venlo 0.6076020 0.2180298 0.1743364
3 Vitesse Heracles Almelo 0.5723334 0.2275537 0.2000907
4 ADO Den Haag Ajax 0.1037534 0.1511710 0.7446496
5 FC Twente SC Heerenveen 0.6605607 0.1558135 0.1822923
6 NAC Breda AZ 0.2539698 0.2627759 0.4832506
7 PSV Feyenoord 0.5055082 0.2147899 0.2796468
Needless to say, I am not impressed. Choosing the most probable outcome as result, I have two correct Twente – Heerenveen and PSV – Feyenoord.
Last week flo2speak commented that (s)he tried to do the same in German football and had 30% correct. I am in the same ballpark. Maybe I should try the ordinal regression too. On top of that; I never predict ties, this weekend had four games tied, and season 2011-2012 had 20% of the games tied. Clearly improvement is needed.
Last week flo2speak commented that (s)he tried to do the same in German football and had 30% correct. I am in the same ballpark. Maybe I should try the ordinal regression too. On top of that; I never predict ties, this weekend had four games tied, and season 2011-2012 had 20% of the games tied. Clearly improvement is needed.
Prediction code
To predict for a weekend I made a wrapper around fbpredict in which a data frame can be used. Not the most efficient code, but for this size of calculation that is not a problem. Promoted teams which are lacking data are also removed.
topred <- read.table(textConnection(“
‘Roda JC’ ‘FC Utrecht’
‘PEC Zwolle’ ‘FC Groningen’
‘RKC Waalwijk’ ‘VVV-Venlo’
‘Vitesse’ ‘Heracles Almelo’
‘NEC’ ‘Willem II’
‘ADO Den Haag’ ‘Ajax’
‘FC Twente’ ‘SC Heerenveen’
‘NAC Breda’ ‘AZ’
‘PSV’ ‘Feyenoord'”
),col.names=c(‘Off’,’Def’))
morepred <- function(topred) {
topred <- topred[topred[,1] %in% levels(model3$data$OffenseClub) &
topred[,2] %in% levels(model3$data$OffenseClub) ,]
ap <- lapply(1:nrow(topred),function(irow) {
fbp <- fbpredict(model3,as.character(topred[irow,1]),
as.character(topred[irow,2]))
sec2 <- fbp[[2]]
mydf <- data.frame(club1=topred[irow,1],
club2=topred[irow,2],
win1=sec2[1],
equal=sec2[2],
win2=sec2[3])
})
dc <- do.call(rbind,ap)
rownames(dc) <- 1:nrow(dc)
dc
}
morepred(topred)
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.