Site icon R-bloggers

The R-squared and nonlinear regression: a difficult marriage?

[This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Making sure that a fitted model gives a good description of the observed data is a fundamental step of every nonlinear regression analysis. To this aim we can (and should) use several techniques, either graphical or based on formal hypothesis testing methods. However, in the end, I must admit that I often feel the need of displaying a simple index, based on a single and largely understood value, that reassures the readers about the goodness of fit of my models.

In linear regression, we already have such an index, that is known as the R2 or the coefficient of determination. In words, this index represents the proportion of variance in the dependent variable that is explained by the regression effects. It ranges from 0 to 1 and, within this interval, the highest the value, the best the fit. The expression is:

\[ R^2 = \frac{SS_{reg}}{SS_{tot}}\]

and it represents the ratio between the regression sum of squares (\(SS_{reg}\)) and total sum of squares (\(SS_{tot}\)), which is equal to the proportion of explained variance when we consider the population variance, that is obtained by dividing both \(SS_{reg}\) and \(SS_{tot}\) by the number of observations (and not by the number of degrees of freedom). In the above expression, it is:

\[SS_{reg} = \sum_{i = 1}^{n}{\left(\hat{y_i} – \bar{y} \right)^2}\]

and:

\[SS_{tot} = \sum_{i = 1}^{n}{\left(y_i – \bar{y} \right)^2}\]

If we also consider the squared residuals from the regression line, we can also define the residual sum of squares as:

\[SS_{res} = \sum_{i = 1}^{n}{\left(y_i – \hat{y_i} \right)^2}\]

where: \(y_i\) is the i-th observation, \(\hat{y_i}\) is the i-th fitted value and \(\bar{y}\) is the overall mean.

In all linear models with an intercept term, the following equality holds:

\[SS_{tot} = SS_{reg} + SS_{res}\]

Therefore, it is always \(SS_{reg} \leq SS_{tot}\), which implies that the R2 value may never be higher than 1 or lower than 0. Furthermore, we can write the alternative (and equivalent) definition:

\[ R^2 = 1 – \frac{SS_{res}}{SS_{tot}}\] Now, the question is:

Can we use the R-squared in nonlinear regression?

Basically, we have two problems:

  1. nonlinear models do not have an intercept term, at least, not in the usual sense;
  2. the equality \(SS_{tot} = SS_{reg} + SS_{res}\) may not hold.

For these reasons, most authors advocate against the use of the R2 in nonlinear regression analysis and recommend alternative measures, such as the Mean Squared Error (MSE; see Ratkowsky, 1990) or the AIC and BIC (see Spiess and Neumeyer, 2010). I would argue that the R2 may have a superior intuitive appeal as far as it is bound to 1 for a perfectly fitting model; with such a constraint, we can immediately see how good is the fit of our model.

Schabenberger and Pierce (2002) recommend the following statistic, that is similar to the R2 for linear models:

\[ \textrm{Pseudo-}R^2 = 1 – \frac{SS_{res}}{SS_{tot}}\]

Why is it a ‘Pseudo-R2’?. In contrast to what happens with linear models, this statistic:

  1. cannot exceed 1, but it may lower than 0;
  2. it cannot be interpreted as the proportion of variance explained by the model

Bearing these two limitations in mind, there is no reason why we should not use such a goodness-of-fit measure with nonlinear regression. In this line, the R2.nls() function in the ‘aomisc’ package can be used to retrieve the R2 and Pseudo-R2 values from a nonlinear model fitted with the nls() and drm() functions.

library(aomisc)
X <- c(0.1, 5, 7, 22, 28, 39, 46, 200)
Y <- c(1, 13.66, 14.11, 14.43, 14.78, 14.86, 14.78, 14.91)

#drm fit
model <- drm(Y ~ X, fct = MM.2())
R2nls(model)$PseudoR2
## [1] 0.9930399
#
# nls fit
model2 <- nls(Y ~ SSmicmen(X, Vm, K))
R2nls(model)$PseudoR2
## [1] 0.9930399

Undoubtedly, the Pseudo-R2 gives, at first glance, a good feel for the quality of our regressions; but, please, do not abuse it. In particular, please, remember that a negative value might indicate a big problem with the fitted model. Above all, remember that the Pseudo-R2, similarly to the R2 in multiple linear regression, should never be used as the basis to select and compare alternative regression model. Other statistics should be used to this aim.

Thanks for reading and happy coding!

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
andrea.onofri@unipg.it


References

  1. Ratkowsky, D.A., 1990. Handbook of nonlinear regression models. Marcel Dekker Inc., Books.
  2. Schabenberger, O., Pierce, F.J., 2002. Contemporary statistical models for the plant and soil sciences. Taylor & Francis, CRC Press, Books.
  3. Spiess, A. N., & Neumeyer, N., 2010. An evaluation of \(R^2\) as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach. BMC Pharmacology, 10, 6.

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.