Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What does under or over-dispersion look like?
One issue that often comes up in analysis with linear models is under or over dispersion.
For instance, if you are fitting a linear regression model, you are assuming the residuals (difference between the line of best fit and the data-points) are normally distributed. Residuals are said to be overdispersed if they are ‘fatter’ in the tails than a normal bell curve. Whereas, if the residuals are too peaked in the middle, they are said to be under-dispersed.
Under or over dispersion is an issue because it can bias the calculation of p-values. Over-dispersion is often of particular concern because it may cause p-values that are biased too low. If you are taking 0.05 as significant, over-dispersion will mean you would reject the null hypothesis more than 5% of the time. Thus, you may be more likely to falsely reject the null hypothesis.
If there is under-dispersion the opposite is true, your p-values may be too high. P-values that are biased high can be a problem, because they will give you low power to detect real effects.
So, let’s take a look at under and overdispersion. First, up here is a normal bell curve:
x <- seq(-4, 4, length.out = 100) p <- dnorm(x, mean = 0, sd = 1) plot(x, p, type = 'l', lwd = 2, col = "red", xlab = "Residual", ylab = "Density")
Now, we can add an over-dispersed curve to that. Here is one, calculated using the Student t distribution:
p_t <- dt(x, df = 1.4, ncp = 0) plot(x, p, type = 'l', lwd = 2, col = "red", xlab = "Residual", ylab = "Density") lines(x, p_t, col = "darkblue", lwd = 2)
It wouldn’t matter how large we made the standard devation of the normal curve, we would never get it to match the Student-t.
Finally, let’s draw an under-dispersed distribution, using the Laplace distribution
library(rmutil) p_l <- dlaplace(x, m = 0, s = 0.4)
Now plot them all together:
plot(x, p_l, xlim = c(-4, 4), lwd = 2, xlab = "Residuals", main = "", type = "l") lines(x, p, col = "red", lwd = 2, lty = 2) lines(x, p_t, col = "darkblue", lwd = 2) legend("topright", legend = c("normal", "under-dispersed", "over-dispersed"), lty = c(2,1,1), col = c("red", "black", "darkblue"), lwd = 2)
Here’s what they look like if we sample measures from each distribution and plot them on a normal QQ plot, which you may be familiar with. It is one of the standard checks of model residuals:
set.seed(1997) par(mfrow = c(1,3)) qqnorm(rnorm(1000, mean = 0, sd = 1), main = "normal") abline(0,1) qqnorm(rt(1000, df = 1.4, ncp = 0), main = "over-dispersed") abline(0,1) qqnorm(rlaplace(1000, m = 0, s = 0.4), main = "under-dispersed") abline(0,1)
Playing around with the parameters for each distribution should help your understanding. For instnace, if you set the scale for the Laplace distribution to a larger number it will become over-dispersed, because it gets fatter tails than the normal (despite being more peaked at its mode).
Under-dispersion is more common than you might think. It can occur when you have a censoring process. For instance, perhaps you machine can only measure length’s to a certain precision and any distance that is to small gets rounded down to zero.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.