Visualising theoretical distributions of GLMs
[This article was first published on mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Two weeks ago I discussed various linear and generalised linear models in R using ice cream sales statistics. The data showed not surprisingly that more ice cream was sold at higher temperatures.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
icecream <- data.frame( temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1, 18.5, 19.4, 22.1, 22.6, 23.4, 25.1), units=c(185L, 215L, 332L, 325L, 408L, 421L, 406L, 412L, 522L, 445L, 544L, 614L) )I used a linear model, a log-transformed linear model, a Poisson and Binomial generalised linear model to predict sales within and outside the range of data available.
I came to the conclusion that I preferred the binomial model for two reasons: the model simulates only whole numbers, just like the observational data and it has natural boundaries defined by zero sales and a market saturation level. Hence, it will neither predict negative sales, nor infinite sales.
However, I didn't review any residual plots or anything else but the mean predictions. So, let's do this in this post.
Arthur Charpentier reminded me via Twitter of an earlier post of his, where he looked at GLMs as well. Arthur used a nice 3D plot to visualise the models. I took his code and wrapped it into a function (it is not pretty, but it will do for now):
glmModelPlot <- function(x, y, xlim,ylim, meanPred, LwPred, UpPred, plotData, main=NULL){ ## Based on code by Arthur Charpentier: ## http://freakonometrics.hypotheses.org/9593 par(mfrow=c(1,1)) n <- 2 N <- length(meanPred) zMax <- max(unlist(sapply(plotData, "[[", "z")))*1.5 mat <- persp(xlim, ylim, matrix(0, n, n), main=main, zlim=c(0, zMax), theta=-30, ticktype="detailed",box=FALSE) C <- trans3d(x, UpPred, rep(0, N),mat) lines(C, lty=2) C <- trans3d(x, LwPred, rep(0, N), mat) lines(C, lty=2) C <- trans3d(c(x, rev(x)), c(UpPred, rev(LwPred)), rep(0, 2*N), mat) polygon(C, border=NA, col=adjustcolor("yellow", alpha.f = 0.5)) C <- trans3d(x, meanPred, rep(0, N), mat) lines(C, lwd=2, col="grey") C <- trans3d(x, y, rep(0,N), mat) points(C, lwd=2, col="#00526D") for(j in N:1){ xp <- plotData[[j]]$x yp <- plotData[[j]]$y z0 <- plotData[[j]]$z0 zp <- plotData[[j]]$z C <- trans3d(c(xp, xp), c(yp, rev(yp)), c(zp, z0), mat) polygon(C, border=NA, col="light blue", density=40) C <- trans3d(xp, yp, z0, mat) lines(C, lty=2) C <- trans3d(xp, yp, zp, mat) lines(C, col=adjustcolor("blue", alpha.f = 0.5)) } }
Linear model
I start by plotting the output of the linear model, a GLM assuming a Normal distribution with an identity link function.The 3D plot shows the observations as circles on the xy-plane, together with the mean predictions as a solid line. In yellow I show the theoretical residual interval between the 5th and 95th quantile of the Normal distributions parametrised by the model output. Additionally I present the density distributions of the model at each observation.
This is conceptually similar to a Q-Q, which compares the standardised residuals with the theoretical quantiles. In the case of a linear model the prediction interval would shrink to the theoretical interval as the number of data points increases.
The 3D plot looks pretty, but I think it is helpful as well. Let me put this in context to the traditional residual plots, see below.
The residual plots identify observations 2, 5 and 10 as extreme points. Indeed, those points are either on the edge or outside (point 10) of the yellow area in the plot above.
xlim <- c(min(icecream$temp)*0.95, max(icecream$temp)*1.05) ylim <- c(floor(min(icecream$units)*0.95), ceiling(max(icecream$units)*1.05)) lin.mod <- glm(units ~ temp, data=icecream, family=gaussian(link="identity")) par(mfrow=c(2,2)) plot(lin.mod) title(outer=TRUE, line = -1, main = list("Linear regression", cex=1.25,col="black", font=2)) meanPred <- predict(lin.mod, type="response") sdgig <- sqrt(summary(lin.mod)$dispersion) UpPred <- qnorm(.95, meanPred, sdgig) LwPred <- qnorm(.05, meanPred, sdgig) plotData <- lapply( seq(along=icecream$temp), function(i){ stp <- 251 x = rep(icecream$temp[i], stp) y = seq(ylim[1], ylim[2], length=stp) z0 = rep(0, stp) z = dnorm(y, meanPred[i], sdgig) return(list(x=x, y=y, z0=z0, z=z)) } ) glmModelPlot(x = icecream$temp, y=icecream$units, xlim=xlim, ylim=ylim, meanPred = meanPred, LwPred = LwPred, UpPred = UpPred, plotData = plotData, main = "Linear regression")
Log-transformed linear model
The log-transformed model shows an increasing variance around the mean as the temperatures increase. Yet, this shouldn't come as a surprise, as this a property of the log-normal distribution. The variance is given as (text{Var}(X) = exp(2mu + sigma^2)(exp(sigma^2) - 1) = text{E}[X]^2 (exp(sigma^2) - 1)) .Again, the residual plots highlight points 2, 5 and 10 as extremes. I note also from the first residual plot that the log-transformed model over-predicts for colder temperatures.
log.lin.mod <- glm(log(units) ~ temp, data=icecream, family=gaussian(link="identity")) par(mfrow=c(2,2)) plot(log.lin.mod) title(outer=TRUE, line = -1, main = list("Log-transformed LM", cex=1.25,col="black", font=2)) meanLogPred <- predict(log.lin.mod, type="response") sdgig <- sqrt(summary(log.lin.mod)$dispersion) meanPred <- exp(meanLogPred + 0.5 * sdgig^2) UpPred <- qlnorm(.95, meanlog = meanLogPred, sdlog = sdgig) LwPred <- qlnorm(.05, meanlog = meanLogPred, sdlog = sdgig) plotData <- lapply( seq(along=icecream$temp), function(i){ stp <- 251 x = rep(icecream$temp[i], stp) y = seq(ylim[1], ylim[2], length=stp) z0 = rep(0, stp) z = dlnorm(y, meanlog = meanLogPred[i], sdlog = sdgig) return(list(x=x, y=y, z0=z0, z=z)) } ) glmModelPlot(x = icecream$temp, y=icecream$units, xlim=xlim, ylim=ylim, meanPred = meanPred, LwPred = LwPred, UpPred = UpPred, plotData = plotData, main = "Log-transformed LM")
Poisson (log) GLM
The Poisson model shows a narrower range between the 5th and 95th quantile then the previous models. My Poisson model didn't assume over-dispersion, and hence the yellow range is constant across the changing temperatures.The same points 2, 5 and 10 are highlighted again as extreme, but now they are well outside the yellow area.
pois.mod <- glm(units ~ temp, data=icecream, family=poisson(link="log")) par(mfrow=c(2,2)) plot(pois.mod) title(outer=TRUE, line = -1, main = list("Poisson (log) GLM", cex=1.25,col="black", font=2)) meanPred <- predict(pois.mod, type="response") UpPred <- qpois(.95, meanPred) LwPred <- qpois(.05, meanPred) plotData <- lapply( seq(along=icecream$temp), function(i){ y = seq(ylim[1], ylim[2]) x = rep(icecream$temp[i], length(y)) z0 = rep(0, length(y)) z = dpois(y, meanPred[i]) return(list(x=x, y=y, z0=z0, z=z)) } ) glmModelPlot(x = icecream$temp, y=icecream$units, xlim=xlim, ylim=ylim, meanPred = meanPred, LwPred = LwPred, UpPred = UpPred, plotData = plotData, main = "Poisson (log) GLM")
Binomial (logit) GLM
Here is my favourite model from my previous post again.The theoretical range between the 5th and 95th quantile has narrowed further. Data item 10 looks now really suspicious. Observations 2 and 5 are once again highlighted. Now I wonder, is someone eating my ice cream?
market.size <- 800 icecream$opportunity <- market.size - icecream$units bin.glm <- glm(cbind(units, opportunity) ~ temp, data=icecream, family=binomial(link = "logit")) par(mfrow=c(2,2)) plot(bin.glm) title(outer=TRUE, line = -1, main = list("Binomial (logit) GLM", cex=1.25,col="black", font=2)) meanProb <- predict(bin.glm, type="response") meanPred <- meanProb*market.size UpPred <- qbinom(.95, market.size, meanProb) LwPred <- qbinom(.05, market.size, meanProb) plotData <- lapply( seq(along=icecream$temp), function(i){ y = ylim[1]:ylim[2] x = rep(icecream$temp[i], length(y)) z0 = rep(0, length(y)) z = dbinom(y, market.size, meanProb[i]) return(list(x=x, y=y, z0=z0, z=z)) } ) glmModelPlot(x = icecream$temp, y=icecream$units, xlim=xlim, ylim=ylim, meanPred = meanPred, LwPred = LwPred, UpPred = UpPred, plotData = plotData, main = "Binomial (logit) GLM")
Conclusions
I think Arthur's 3D plots help to visualise what GLMs are conceptually about. They illustrate the theoretical distribution around the predictions. Let me say this again, the yellow areas in the charts above show not the 90% prediction interval, but the theoretical residual distribution if the models where true.Session Info
R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.4 (Yosemite) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.2.2 Rcpp_0.12.0
To leave a comment for the author, please follow the link and comment on their blog: mages' blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.