Visualising the predictive distribution of a log-transformed linear model

[This article was first published on mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I presented visualisations of theoretical distributions that predict ice cream sales statistics based on linear and generalised linear models, which I introduced in an earlier post.
Theoretical distributions
Today I will take a closer look at the log-transformed linear model and use Stan/rstan, not only to model the sales statistics, but also to generate samples from the posterior predictive distribution.

The posterior predictive distribution is what I am most interested in. From the simulations I can get the 95% prediction interval, which will be slightly wider than the theoretical 95% interval, as it takes into account the parameter uncertainty as well.

Ok, first I take my log-transformed linear model of my earlier post and turn it into a Stan model, including a section to generate output from the posterior predictive distribution.

stanLogTransformed <-"
data {
int N;
vector[N] units;
vector[N] temp;
}
transformed data {
vector[N] log_units;
log_units = log(units);
}
parameters {
real alpha;
real beta;
real tau;
}
transformed parameters {
real sigma;
sigma = 1.0 / sqrt(tau);
}
model{
// Model
log_units ~ normal(alpha + beta * temp, sigma);
// Priors
alpha ~ normal(0.0, 1000.0);
beta ~ normal(0.0, 1000.0);
tau ~ gamma(0.001, 0.001);
}
generated quantities{
vector[N] units_pred;
for(i in 1:N)
units_pred[i] = exp(normal_rng(alpha + beta * temp[i], sigma));
}
"

After I have complied and run the model, I can extract the simulations and calculate various summary statistics. Furthermore, I use my parameters also to predict the median and mean, so that I can compare them against the sample statistics. Note again, that for the mean calculation of the log-normal distribution I have to take into account the variance as well.

temp <- c(11.9,14.2,15.2,16.4,17.2,18.1,18.5,19.4,22.1,22.6,23.4,25.1)
units <- c(185L,215L,332L,325L,408L,421L,406L,412L,522L,445L,544L,614L)
library(rstan)
stanmodel <- stan_model(model_code = stanLogTransformed)
fit <- sampling(stanmodel,
data = list(N=length(units),
units=units,
temp=temp),
iter = 1000, warmup=200)
stanoutput <- extract(fit)
## Extract generated posterior predictive quantities
Sims <- data.frame(stanoutput[["units_pred"]])
## Calculate summary statistics
SummarySims <- apply(Sims, 2, summary)
colnames(SummarySims) <- paste(icecream$temp,"ºC")
## Extract estimated parameters
(parms <- sapply(stanoutput[c("alpha", "beta", "sigma")], mean))
## alpha beta sigma
## 4.41543439 0.08186099 0.14909969
## Use parameters to predict median and mean
PredMedian <- exp(parms['alpha'] + parms['beta']*temp)
PredMean <- exp(parms['alpha'] + parms['beta']*temp + 0.5*parms['sigma']^2)
## Compare predictions based on parameters with simulation statistics
round(rbind(SummarySims, PredMedian, PredMean),1)
## 11.9 ºC 14.2 ºC 15.2 ºC 16.4 ºC 17.2 ºC 18.1 ºC
## Min. 116.1 101.6 123.6 140.6 164.0 209.5
## 1st Qu. 196.2 237.6 259.1 285.7 307.5 329.7
## Median 218.5 263.7 286.1 318.3 338.8 364.1
## Mean 221.9 267.2 290.2 321.0 342.6 369.3
## 3rd Qu. 243.5 292.2 317.5 350.4 372.7 403.4
## Max. 418.7 542.4 740.9 688.0 736.0 778.5
## PredMedian 218.5 264.1 286.8 316.5 338.1 364.1
## PredMean 220.9 267.0 289.9 320.0 341.8 368.1
## 18.5 ºC 19.4 ºC 22.1 ºC 22.6 ºC 23.4 ºC 25.1 ºC
## Min. 198.9 177.2 241.9 271.6 249.1 337.0
## 1st Qu. 339.6 368.7 456.5 476.9 508.7 577.9
## Median 375.2 405.2 506.2 527.4 562.9 645.2
## Mean 379.3 410.4 512.2 534.3 572.1 653.7
## 3rd Qu. 412.2 446.0 557.5 582.6 625.4 718.2
## Max. 678.1 984.9 944.6 1054.0 1282.0 1358.0
## PredMedian 376.3 405.3 506.2 527.4 563.4 648.0
## PredMean 380.4 409.6 511.6 533.1 569.5 655.0

Ok, that looks pretty reasonable, and also quite similar to my earlier output with glm. Using my plotting function of last week I can also create a nice 3D plot again.

meanPred <- apply(Sims, 2, mean)
LwPred <- apply(Sims, 2, quantile, 0.05)
UpPred <- apply(Sims, 2, quantile, 0.95)
xlim <- c(min(icecream$temp)*0.95, max(temp)*1.05)
ylim <- c(floor(min(units)*0.95),
ceiling(max(units)*1.05))
plotStan <- lapply(
seq(along=temp),
function(i){
stp = 251
d = density(Sims[, i, ], n=stp)
y = seq(ylim[1], ylim[2], length=stp)
z = approx(d$x, d$y, y)$y
z[is.na(z)] <- 0
x = rep(icecream$temp[i], stp)
z0 = rep(0, stp)
return(list(x=x, y=y, z0=z0, z=z))
}
)
# https://gist.github.com/mages/dedfb0d97082f0f0e0ab
glmModelPlot(x=temp, y=units,
xlim=xlim, ylim=ylim,
meanPred = meanPred, LwPred = LwPred,
UpPred = UpPred, plotData = plotStan,
main = "Log-transformed LM prediction")

Posterior predictive distributions

Just as expected, I note a slightly wider 95% interval range in the posterior predictive distributions compared to the theoretical distributions at the top.

Session Info

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils datasets 
[6] methods   base     

other attached packages:
[1] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0  

loaded via a namespace (and not attached):
[1] tools_3.2.2 codetools_0.2-14 stats4_3.2.2

To leave a comment for the author, please follow the link and comment on their blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)