Prediction intervals for Generalized Additive Models (GAMs)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Finding prediction intervals (for future observations) is something different than finding confidence intervals (for unknown population parameters).
Here, I demonstrate one approach to doing so.
First we load the library and simulate some data:
library(mgcv) set.seed(1) dat <- gamSim(eg = 1, n = 400, dist = "normal", scale = 2) ## Gu & Wahba 4 term additive model
The simulated in dat
contains the “truth” in the f
variables:
str(dat) ## 'data.frame': 400 obs. of 10 variables: ## $ y : num 3.3407 -0.0758 10.6832 8.7291 14.9911 ... ## $ x0: num 0.266 0.372 0.573 0.908 0.202 ... ## $ x1: num 0.659 0.185 0.954 0.898 0.944 ... ## $ x2: num 0.8587 0.0344 0.971 0.7451 0.2733 ... ## $ x3: num 0.367 0.741 0.934 0.673 0.701 ... ## $ f : num 5.51 3.58 8.69 8.75 16.19 ... ## $ f0: num 1.481 1.841 1.948 0.569 1.184 ... ## $ f1: num 3.74 1.45 6.74 6.02 6.6 ... ## $ f2: num 2.98e-01 2.88e-01 8.61e-05 2.16 8.40 ... ## $ f3: num 0 0 0 0 0 0 0 0 0 0 ... fit_lm <- lm(f ~ f0 + f1 + f2 + f3, data = dat) plot(dat$f, predict(fit_lm)) abline(0, 1)
And what can be used to demonstrate GAMs are available in the y
and x
variables:
fit_gam <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)
The idea is to simulate parameter vectors, and using such a simulated parameter vector, we can draw observations. Doing this many times can be used to generate a prediction interval.
We first extract the parameter estimates and their covariance matrix:
beta <- coef(fit_gam) V <- vcov(fit_gam)
We then to simulate a number of parameter vectors. This is done by assuming approximate normality and using the Cholesky trick. The trick is that \[ \hat{\beta} + L \nu \quad \dot \sim \quad \text{MultivariateNormal}(\beta, V) , \] where \(V = L L^\top\) is the Cholesky factorisation of the covariance matrix \(V\) and \(\nu \sim N(0, 1)\).
num_beta_vecs <- 10000 Cv <- chol(V) set.seed(1) nus <- rnorm(num_beta_vecs * length(beta)) beta_sims <- beta + t(Cv) %*% matrix(nus, nrow = length(beta), ncol = num_beta_vecs) dim(beta_sims) ## [1] 37 10000
This should give us something similar to the confidence intervals of the model fit:
d_beta <- cbind(summary(fit_gam)$se, apply(beta_sims, 1, sd)) head(d_beta) ## [,1] [,2] ## (Intercept) 0.1036817 0.1039225 ## s(x0).1 0.3977579 0.3988932 ## s(x0).2 0.7249847 0.7212706 ## s(x0).3 0.1967577 0.1982036 ## s(x0).4 0.4260063 0.4265382 ## s(x0).5 0.1450504 0.1444811 plot(d_beta[, 1], d_beta[, 2], xlab = "Calculated SE", ylab = "Simulated SE") abline(0, 1)
We can now proceed to simulating observations. We do that by randomly sampling from the observations with replacement:
n_obs <- 100 sim_idx <- sample.int(nrow(dat), size = n_obs, replace = TRUE) sim_dat <- dat[sim_idx, c("x0", "x1", "x2", "x3")] dim(sim_dat) ## [1] 100 4 # sim_dat <- dat[, c("x0", "x1", "x2", "x3")] # dim(sim_dat)
Instead of doing such a sampling, it is also possible to use e.g. a grid of values:
if (FALSE) { # Intentionally not run xs_range <- apply(dat[, c("x0", "x1", "x2", "x3")], 2, range) xs_seq <- apply(xs_range, 2, function(x) seq(from = x[1], to = x[2], length.out = 10)) xs_seq_lst <- as.list(as.data.frame(xs_seq)) sim_dat <- do.call(expand.grid, list(xs_seq_lst, KEEP.OUT.ATTRS = FALSE)) dim(sim_dat) }
Now we can calculate the linear predictors:
covar_sim <- predict(fit_gam, newdata = sim_dat, type = "lpmatrix") linpred_sim <- covar_sim %*% beta_sims
Now, the inverse link function must be applied to obtain the expected values.
So if e.g. family = Gamma(link = log)
was used in gam()
, then
exp()
must be used.
We used the Gaussian family with identity link so the inverse is the identity:
invlink <- function(x) x exp_val_sim <- invlink(linpred_sim)
Now, the family itself must be used.
If Gamma was used, then rgamma()
(properly parametised) must be used.
We just used Gaussian, so:
y_sim <- matrix(rnorm(n = prod(dim(exp_val_sim)), mean = exp_val_sim, sd = summary(fit_gam)$scale), nrow = nrow(exp_val_sim), ncol = ncol(exp_val_sim)) dim(y_sim) ## [1] 100 10000
So now entry \((i, j)\) of y_sim
contains the \(i\)’th simulated observation’s simulated response
under parameter vector \(j\) (column \(j\) of beta_sims
).
We can then find a 95% prediction interval:
pred_int_sim <- apply(y_sim, 1, quantile, prob = c(.025, 0.975)) dim(pred_int_sim) ## [1] 2 100
And e.g. plot it against x0
:
plot(y ~ x0, data = dat) sim_dat_x0ord <- order(sim_dat$x0) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2)
Confidence intervals can be added:
plot(y ~ x0, data = dat) sim_dat_x0ord <- order(sim_dat$x0) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2) sim_dat_pred <- predict(fit_gam, newdata = sim_dat, se = TRUE) lines(sim_dat$x0[sim_dat_x0ord], sim_dat_pred$fit[sim_dat_x0ord], col = 1, lwd = 2) upr_ci <- invlink(sim_dat_pred$fit + 2*sim_dat_pred$se.fit) lwr_ci <- invlink(sim_dat_pred$fit - 2*sim_dat_pred$se.fit) lines(sim_dat$x0[sim_dat_x0ord], upr_ci[sim_dat_x0ord], col = 3, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], lwr_ci[sim_dat_x0ord], col = 3, lwd = 2)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.