Heteroscedasticity in Regression — It Matters!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R’s main linear and nonlinear regression functions, lm() and nls(), report standard errors for parameter estimates under the assumption of homoscedasticity, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor variables.
Take for instance my favorite introductory example, predicting human weight from height. (It is unlikely we’d need to make such a prediction, but it serves as a quick and easy illustration.) The homoscedasticity assumption says that tall people have no more variation in weight than short people, certainly not true.
So, confidence intervals and p-values calculated by lm() and nls() are generally invalid. In this blog post, I’ll show how to remedy this problem — including some new code that I’ll provide here — and give an example showing just how far wrong one can be without a remedy.
In the linear case, the remedy is simple, but probably not widely known. Years ago, researchers such as Eickert and White derived large-sample theory for the heteroscedastic variance case, and it is coded in CRAN’s car and sandwich packages in the functions hccm() and vcovHC(), respectively. I’ll use the latter here, as its name is similar to that of R’s vcov() function.
The latter inputs the result of a call to lm() or nls(), and outputs the estimated covariance matrix of your estimated parameter vector. Thus the standard errors of the estimated parameters are the square roots of the diagonal elements of the matrix returned by vcov(). Let’s check that, using the example from the online help for lm():
> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) > trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) > group <- gl(2, 10, 20, labels = c("Ctl","Trt")) > weight <- c(ctl, trt) > lm.D9 <- lm(weight ~ group) > summary(lm.D9) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.2202 22.850 9.55e-15 groupTrt -0.3710 0.3114 -1.191 0.249 > diag(sqrt(vcov(lm.D9))) (Intercept) groupTrt 0.2202177 0.3114349
Sure enough, the standard errors match. (Were you doubting it? 🙂 )
So, the linear case, i.e. lm(), is taken care of. But unfortunately, there doesn’t seem to be anything for the nonlinear case, nls(). The following code solves that problem:
library(minpack.lm) library(sandwich) nlsvcovhc <- function(nlslmout) { b <- coef(nlslmout) m <- nlslmout$m resid <- m$resid() hmat <- m$gradient() fakex <- hmat fakey <- resid + hmat %*% b lmout <- lm(fakey ~ fakex - 1) vcovHC(lmout) }
(The names such as fakex, which I should remove, are explained in the last paragraph below.) You can download this code (complete with comments, which I’ve omitted here for brevity) here. The function is applied to the output of nlsLM(), which is a modified version of nls() in the package minpack.lm, which has better convergence behavior.
Here’s an example, using the enzyme data set vmkmki from the CRAN package nlstools. (I removed the last 12 observations, which for reasons I won’t go into here seem to be anomalous.) The nonlinear model here is the one given in the online help for vmkmki.
> regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3)) > bstart <- list(b1=1,b2=1, b3=1) > z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,start=list(b1=1,b2=1, b3=1)) > z Nonlinear regression model model: v ~ regftn(S, I, b1, b2, b3) data: vmkmki b1 b2 b3 18.06 15.21 22.28 ... > vcov(z) b1 b2 b3 b1 0.4786776 1.374961 0.8930431 b2 1.3749612 7.568837 11.1332821 b3 0.8930431 11.133282 29.1363366
Compare that to using Eickert-White:
ttt# get new z from nlsLM(), not shown > nlsvcovhc(z) fakex1 fakex2 fakex3 fakex1 0.4708209 1.706591 2.410712 fakex2 1.7065910 10.394496 20.314688 fakex3 2.4107117 20.314688 53.086958
This is rather startling! Except for the estimated variance of the first parameter estimate, the estimated variances and covariances from Eickert-White are much larger than what nls()} found under the assumption of homoscedasticity. In other words, nls() was way off the mark. In fact, this simulation code shows that the standard errors reported by nls() can lead, for instance, to confidence intervals having only 60% coverage probability rather than 90%, due to heteroscedasticity, while the above code fixes the problem.
How does that code work? (Optional reading.) Most nonlinear least-squares procedures use a local linear approximation, such as in the Gauss-Newton algorithm. This results computationally in a fake lm() setting. As such, we are already set up to use the delta method. We can then apply vcovHC().
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.