Checking residual distributions for non-normal GLMs

Bluecology blog

4 years ago

[This article was first published on Bluecology blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Checking residual distributions for non-normal GLMs

Quantile-quantile plots

If you are fitting a linear regression with Gaussian (normally distributed) errors, then one of the standard checks is to make sure the residuals are approximately normally distributed.

It is a good idea to do these checks for non-normal GLMs too, to make sure your residuals approximate the model’s assumption.

Here I explain how to create quantile-quantile plots for non-normal data, using an example of fitting a GLM using Student-t distributed errors. Such models can be appropriate when the residuals are overdispersed.

First let’s create some data. We will make a linear predictor (ie the true regression line) eta and then simulate some data by adding residuals. We will simulate two data-sets that have the same linear predictor, but the first will have normally distributed errors and the second will have t distributed errors:

n <- 100
phi <- 0.85
mu <- 0.5
set.seed(23)
x <- rnorm(n)

eta <- mu + phi * x
nu <- 2.5
tau <- 3

y_tdist <- eta + (rt(n, df=nu)/sqrt(tau))
y_normdist <- eta + rnorm(n, sd = 1/sqrt(tau))

plot(x, y_tdist)
points(x, y_normdist, col = "red", pch = 16, cex = 0.8)
legend("topleft", legend = c("t-distributed errors", "normally distributed errors"), pch = c(1,16), col = c("black", "red"))

Notice how the t-distributed data are more spread out. The df parameter, here named nu=2.5 controls how dispersed the data are. Lower values will give data that are more dispersed, large values approach a normal distribution.

Now let’s fit a Gaussian glm (just a linear regression really) to both these data-sets

m1_norm <- glm(y_normdist ~ x)
m1_t <-  glm(y_tdist ~ x)

We should check whether the two models meet the normal assumption, using the standard ‘qq’ (quantile-quantile) plot:

par(mfrow = c(1,2))
plot(m1_norm, 2)
plot(m1_t, 2)

These plots compare the theoretical quantiles to the actual quantiles of the residuals. If the points fall on the straight line then the theoretical and realised are very similar, and the assumption is met. Clearly this is not the case for the second model, which is overdispersed.

We know it is overdispersed because the theoretical quantiles are much smaller than the actual at the tails (notice how the ends down then up).

The p-values (or CIs if you use them) for m1_t are therefore likely biased and too narrow, leading potentially to type I errors (us saying that x affects y, which in fact it does not). In this case we know we haven’t made a type I error, because we made up the data. However, if you were using real data you wouldn’t be so sure.

Doing our own quantile-quantile plot

To better understand the QQ plot it helps to generate it yourself, rather than using R’s automatic checks.

First we calculate the model residuals (in plot(m1_t) R did this internally):

m1_t_resid <- y_tdist - predict(m1_t)

Then we can plot the quantiles for the residuals against theoretical quantiles generated using qnorm. Below we also plot the original QQ plot from above, so you can see that our version is the same as R’s automatic one:

par(mfrow = c(1,2))
qqplot(qnorm(ppoints(n), sd = 1), m1_t_resid)
qqline(m1_t_resid, lty = 2)

plot(m1_t,2)

I added the qqline for comparative purposes. It just puts a line through the 25th and 75th quantiles.

QQ plot for a non-normal GLM

Now we have learned how to write our own custom for a QQ plot, we can use it to check other types of non-normal data.

Here we will fit a GLM to the y_tdist data using student-t distributed errors. I do this using the Bayesian package INLA.

library(INLA)

data <- list(y=y_tdist, x = x)
mod_tdist <- inla(y ~ x, family="T", data=data,
    control.predictor = list(compute = TRUE),
    control.family =
        list(
            hyper =
                list(prec = list(prior="loggamma",param=c(1,0.5)),
                    dof = list(prior="loggamma",param=c(1,0.5))
                    )
            )
        )

The family ="T" command tells INLA to use the t-distribution, rather than the Normal distribution. Note also I have specified the priors using the control.family command. This is best practice. We need a prior for the precision (1/variance) and a prior for the dof (= degrees of freedom, which has to be >2 in INLA).

It is sometimes help to visualise the priors, so we can check too see they look sensible. Here we visualise the prior for the dof, (which in INLA has a min value of 2):

xgamma <- seq(0.01, 10, length.out = 100)
plot(xgamma+2, dgamma(xgamma, 1, 0.5), type = 'l', xlab = "Quantile", ylab = "Density")

We don’t really expect values much greater than 10, so this prior makes sense. If we used an old-school prior that was flat in 2-1000 we might get issues with model fitting.

Now enough about priors. Let’s look at the estimated coefficients:

mod_tdist$summary.fixed

##                  mean         sd 0.025quant  0.5quant 0.975quant      mode
## (Intercept) 0.5324814 0.07927198  0.3773399 0.5321649  0.6891779 0.5315490
## x           0.7229362 0.08301006  0.5565746 0.7239544  0.8835630 0.7259817
##                      kld
## (Intercept) 3.067485e-12
## x           6.557565e-12

Good the CIs contain our true values, and the mean is close to our true value too. What about the hyper-parameters (the precision and DF)? We need to get INLA to run some more calucations to get accurate estimates of these:

h_tdist <- inla.hyperpar(mod_tdist)
h_tdist$summary.hyperpar[,3:5]

##                                          0.025quant  0.5quant 0.975quant
## precision for the student-t observations  0.2663364 0.6293265   1.163440
## degrees of freedom for student-t          2.2404966 2.7396391   4.459057

The estimate for the DF might be a ways off the mark. That is ok, we expect that, you need lots of really good data to get accurate estimates of hyper-parameters.

Now, let’s use our skills in creating QQ plots to make QQ plot using theoretical quantiles from the t distribution.

First step is to extract INLA’s predictions of the data, so we can calculate residuals

preds <- mod_tdist$summary.fitted.values
resids <- y_tdist - preds[,4]

Next step is to extract the marginal estimates of the DF and precision to use when generating our QQ plot (the quantiles will change with the DF):

tau_est <- h_tdist$summary.hyperpar[1,4]
nu_est <- h_tdist$summary.hyperpar[2,4]

Now we can use qt() to generate theoretical quantiles and the residuals for our realised quantiles:

qqplot(qt(ppoints(n), df = nu_est), resids * sqrt(tau_est),
    xlab = "Theoretical quantile", ylab = "residuals")
qqline(resids * sqrt(tau_est), lty = 2)

Note that I multiply the residuals by the sqrt of the precision estimate. This is how INLA fits a t-distributed GLM. I do the same for the qqline.

Our residuals are now falling much closer to the line. The model is doing a much better job of fitting the data. You could also calculate the WAIC for this model and a Gaussian one, to compare the fits. The t-distributed GLM should have a lower WAIC (better fit).

We can now be confident that our CIs are accurate.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.