[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The football season has started in Netherlands, so I went and had a look at last year’s scores. I did not find downloadable data, at http://www.eredivisiestats.nl/wedstrijden.php I could copy last season’s data and paste into a spreadsheet. The default layout did not look good to me for statistical analysis (first lines shown below), so first part is reformatting to get how many goals a club did against another. This means that each game becomes two lines and an indicator variable is used for who was playing at home.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Datum | Thuisclub (home) | Uitclub (away) | Thuisscore | Uitscore |
2011-08-05 | Excelsior | Feyenoord | 0 | 2 |
2011-08-06 | RKC Waalwijk | Heracles Almelo | 2 | 2 |
2011-08-06 | Roda JC | FC Groningen | 2 | 1 |
# read data
old <- read.csv(‘seizoen1112.csv’,stringsAsFactors=FALSE)
# reformat lengthwise
StartData <- with(old,data.frame(
OffenseClub = c(Thuisclub,Uitclub),
DefenseClub = c(Uitclub,Thuisclub),
Goals = c(Thuisscore,Uitscore),
OffThuis = rep(c(1,0),each=length(Datum))
))
#remove space as first character in names
levels(StartData$OffenseClub) <- sub(‘^ ‘,”,levels(StartData$OffenseClub))
levels(StartData$DefenseClub) <- sub(‘^ ‘,”,levels(StartData$DefenseClub))
# check clubs have same name and ordening in both factors.
all(levels(StartData$OffenseClub)==levels(StartData$DefenseClub))
Distribution of goals
To get to know the data the distribution of goals is most interesting:
xt <- xtabs(~Goals ,data=StartData)
plot(xt)
The first question is; is this approximately Poisson distributed? MASS has some functions.
library(MASS)
(fd <- fitdistr(StartData$Goals, densfun=”Poisson”))
lambda
1.62908497
(0.05159364)
round(dpois(0:8,coef(fd))*sum(xt))
plot(xt,ylim=c(0,200),ylab=’Frequency of goals’)
points(x=0:8,y=dpois(0:8,coef(fd))*sum(xt),col=’green’)
It is also possible to do the same calculations with the fitdistrplus package and get some extra information. Two different algorithms give essentially the same information.
library(fitdistrplus)
(fd2 <- fitdist(StartData$Goals, distr=’pois’,method=’mle’))
Fitting of the distribution ‘ pois ‘ by maximum likelihood
Parameters:
estimate Std. Error
lambda 1.629085 0.05159362
fitdist(StartData$Goals, distr=’pois’,method=’mme’)
Fitting of the distribution ‘ pois ‘ by matching moments
Parameters:
estimate
lambda 1.629085
plot(fd2)
This is nice. I get the same information as my own plot, but also a CDF and have to do less. If only I knew all packages in R, I would not have to program anything. But there are more extra’s in fitdistrplus. Goodness of fit and bootstrap confidence interval.
gofstat(fd2,print.test=TRUE)
Chi-squared statistic: 18.46365
Degree of freedom of the Chi-squared distribution: 4
Chi-squared p-value: 0.001001433
bd <- bootdist(fd2)
summary(bd)
Parametric bootstrap medians and 95% percentile CI
Median 2.5% 97.5%
1.627451 1.527803 1.728736
plot(density(bd$estim$lambda))
Finally, if desired the same can be done the Bayesian way via Jags
library(R2jags)
calcData <- with(StartData,list(N=length(Goals),Goals=Goals))
mijnmodel <- function() {
for (i in 1:N) {
Goals[i] ~ dpois(mu)
}
mu ~ dunif(0,3)
}
write.model(mijnmodel,’model.file’)
jags.inits <- function() {
list(
mu = runif(1,0,3)
)
}
jags.params=c(‘mu’)
jags(data=calcData, inits=jags.inits, jags.params,model.file=’model.file’)
Inference for Bugs model at “model.file”, fit using jags,
3 chains, each with 2000 iterations (first 1000 discarded)
n.sims = 3000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
mu 1.633 0.051 1.533 1.598 1.632 1.667 1.732 1.001 2100
devianc 2048.287 1.338 2047.308 2047.403 2047.757 2048.671 2052.118 1.006 1500
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)
pD = 0.9 and DIC = 2049.2
DIC is an estimate of expected predictive error (lower deviance is better).
This leads to essentially the same result as before.
Conclusion
Three ways to get parameters of a distribution are used. The most obvious way, via MASS gives parameters and standard deviation. A dedicated package gives all that plus significance and distribution. The Bayesian way gives the distribution too and uses most code by far. The estimates do not differ in any relevant way.
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.