Visualising the predictive distribution of a log-transformed linear model
[This article was first published on mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
![]() |
Theoretical distributions |
The posterior predictive distribution is what I am most interested in. From the simulations I can get the 95% prediction interval, which will be slightly wider than the theoretical 95% interval, as it takes into account the parameter uncertainty as well.
Ok, first I take my log-transformed linear model of my earlier post and turn it into a Stan model, including a section to generate output from the posterior predictive distribution.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
stanLogTransformed <-" | |
data { | |
int N; | |
vector[N] units; | |
vector[N] temp; | |
} | |
transformed data { | |
vector[N] log_units; | |
log_units = log(units); | |
} | |
parameters { | |
real alpha; | |
real beta; | |
real tau; | |
} | |
transformed parameters { | |
real sigma; | |
sigma = 1.0 / sqrt(tau); | |
} | |
model{ | |
// Model | |
log_units ~ normal(alpha + beta * temp, sigma); | |
// Priors | |
alpha ~ normal(0.0, 1000.0); | |
beta ~ normal(0.0, 1000.0); | |
tau ~ gamma(0.001, 0.001); | |
} | |
generated quantities{ | |
vector[N] units_pred; | |
for(i in 1:N) | |
units_pred[i] = exp(normal_rng(alpha + beta * temp[i], sigma)); | |
} | |
" |
After I have complied and run the model, I can extract the simulations and calculate various summary statistics. Furthermore, I use my parameters also to predict the median and mean, so that I can compare them against the sample statistics. Note again, that for the mean calculation of the log-normal distribution I have to take into account the variance as well.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
temp <- c(11.9,14.2,15.2,16.4,17.2,18.1,18.5,19.4,22.1,22.6,23.4,25.1) | |
units <- c(185L,215L,332L,325L,408L,421L,406L,412L,522L,445L,544L,614L) | |
library(rstan) | |
stanmodel <- stan_model(model_code = stanLogTransformed) | |
fit <- sampling(stanmodel, | |
data = list(N=length(units), | |
units=units, | |
temp=temp), | |
iter = 1000, warmup=200) | |
stanoutput <- extract(fit) | |
## Extract generated posterior predictive quantities | |
Sims <- data.frame(stanoutput[["units_pred"]]) | |
## Calculate summary statistics | |
SummarySims <- apply(Sims, 2, summary) | |
colnames(SummarySims) <- paste(icecream$temp,"ºC") | |
## Extract estimated parameters | |
(parms <- sapply(stanoutput[c("alpha", "beta", "sigma")], mean)) | |
## alpha beta sigma | |
## 4.41543439 0.08186099 0.14909969 | |
## Use parameters to predict median and mean | |
PredMedian <- exp(parms['alpha'] + parms['beta']*temp) | |
PredMean <- exp(parms['alpha'] + parms['beta']*temp + 0.5*parms['sigma']^2) | |
## Compare predictions based on parameters with simulation statistics | |
round(rbind(SummarySims, PredMedian, PredMean),1) | |
## 11.9 ºC 14.2 ºC 15.2 ºC 16.4 ºC 17.2 ºC 18.1 ºC | |
## Min. 116.1 101.6 123.6 140.6 164.0 209.5 | |
## 1st Qu. 196.2 237.6 259.1 285.7 307.5 329.7 | |
## Median 218.5 263.7 286.1 318.3 338.8 364.1 | |
## Mean 221.9 267.2 290.2 321.0 342.6 369.3 | |
## 3rd Qu. 243.5 292.2 317.5 350.4 372.7 403.4 | |
## Max. 418.7 542.4 740.9 688.0 736.0 778.5 | |
## PredMedian 218.5 264.1 286.8 316.5 338.1 364.1 | |
## PredMean 220.9 267.0 289.9 320.0 341.8 368.1 | |
## 18.5 ºC 19.4 ºC 22.1 ºC 22.6 ºC 23.4 ºC 25.1 ºC | |
## Min. 198.9 177.2 241.9 271.6 249.1 337.0 | |
## 1st Qu. 339.6 368.7 456.5 476.9 508.7 577.9 | |
## Median 375.2 405.2 506.2 527.4 562.9 645.2 | |
## Mean 379.3 410.4 512.2 534.3 572.1 653.7 | |
## 3rd Qu. 412.2 446.0 557.5 582.6 625.4 718.2 | |
## Max. 678.1 984.9 944.6 1054.0 1282.0 1358.0 | |
## PredMedian 376.3 405.3 506.2 527.4 563.4 648.0 | |
## PredMean 380.4 409.6 511.6 533.1 569.5 655.0 |
Ok, that looks pretty reasonable, and also quite similar to my earlier output with
glm
. Using my plotting function of last week I can also create a nice 3D plot again.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
meanPred <- apply(Sims, 2, mean) | |
LwPred <- apply(Sims, 2, quantile, 0.05) | |
UpPred <- apply(Sims, 2, quantile, 0.95) | |
xlim <- c(min(icecream$temp)*0.95, max(temp)*1.05) | |
ylim <- c(floor(min(units)*0.95), | |
ceiling(max(units)*1.05)) | |
plotStan <- lapply( | |
seq(along=temp), | |
function(i){ | |
stp = 251 | |
d = density(Sims[, i, ], n=stp) | |
y = seq(ylim[1], ylim[2], length=stp) | |
z = approx(d$x, d$y, y)$y | |
z[is.na(z)] <- 0 | |
x = rep(icecream$temp[i], stp) | |
z0 = rep(0, stp) | |
return(list(x=x, y=y, z0=z0, z=z)) | |
} | |
) | |
# https://gist.github.com/mages/dedfb0d97082f0f0e0ab | |
glmModelPlot(x=temp, y=units, | |
xlim=xlim, ylim=ylim, | |
meanPred = meanPred, LwPred = LwPred, | |
UpPred = UpPred, plotData = plotStan, | |
main = "Log-transformed LM prediction") |
![]() |
Posterior predictive distributions |
Just as expected, I note a slightly wider 95% interval range in the posterior predictive distributions compared to the theoretical distributions at the top.
Session Info
R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.10.5 (Yosemite) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base other attached packages: [1] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0 loaded via a namespace (and not attached): [1] tools_3.2.2 codetools_0.2-14 stats4_3.2.2
To leave a comment for the author, please follow the link and comment on their blog: mages' blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.