Positive coefficient regression in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m currently working on a paper about simulating urban demands for electricity and gas at 5 minute resolution. To do this, I have a simple regression model that tries to explain observed consumption based on local population figures and simulated levels of activity demands (e.g. minutes spent at work, leisure, etc). The data set looks like this, where each of the letters is an activity code:
> head(data) zone pop elec gas A B I J L M O R S W 1 0 7412 46221768 124613714 0 0 0 0 0 0 0 0 0 0 2 4 7428 37345875 100944002 60 3060 120 1020 0 0 4900 390 510 28635 3 7 7464 20914281 64109628 0 1155 510 255 0 0 11000 225 2475 29580 4 10 7412 46221768 124613714 0 680 0 390 0 0 5145 0 0 9300 5 14 7128 69233086 36611811 0 1335 105 210 60 0 5970 0 2520 14910 6 17 7608 40783190 59343776 0 150 0 150 0 0 1500 0 0 555
I then performed a basic regression using lm
, removing the intercept (the “- 1″) as I want the population coefficient to serve a similar purpose for this analysis:
lm.elec <- lm(elec/365 ~ pop + A + B + I + J + L + M + O + R + S + W - 1, data)
Using Andrew Gelman’s helpful arm package, I can get a quick overview of the result as shown below. It doesn’t look too bad at first, but then I noticed all sorts of negative coefficients for the activity levels. This makes no sense: when individuals perform activities such as going to work, going to school, or shopping, we expect that their demand for electricity should go up, not down.
> display(lm.elec) lm(formula = elec/365 ~ pop + A + B + I + J + L + M + O + R + S + W - 1, data = data) coef.est coef.se pop 13.99 1.35 A 177.07 83.87 B -63.90 27.80 I -0.80 21.05 J 91.27 34.19 L -1075.76 391.53 M -5.57 73.50 O 55.47 9.90 R -336.14 178.98 S 28.12 3.84 W -8.19 3.16 --- n = 391, k = 11 residual sd = 159994.17, R-Squared = 0.65
Since a linear regression is essentially an optimization problem, my immediate thought was: can I just constrain the coefficient values so that they are all positive? This would mean that some activities might have no significant effect on consumption, but at least they couldn’t have a negative impact. And it turns out, yes, you can do this using the nnls package and function.
The nnls
function is not quite as user-friendly as lm
so the first thing you have to do is manually define your input variable matrix and output vector. For example:
A <- as.matrix(data[,c("pop","A","B","I","J","L","M","O","R","S","W")]) b.elec <- data$elec/365
The analysis can then be run as:
nnls.elec <- nnls(A,b.elec)
Similarly, we can’t use predict
to generate our results and have to manually perform the matrix multiplication. This can be done as shown below.
coef.elec <- coef(nnls.elec) pred.elec.nnls <- as.vector(coef.elec%*%t(A))
This method also doesn’t give you an r2 value per se, but you can estimate it with the following dummy regression:
> lm.elec.dummy <- lm(b.elec~pred.elec.nnls - 1) > display(lm.elec.dummy) lm(formula = b.elec ~ pred.elec.nnls - 1) coef.est coef.se pred.elec.nnls 1.00 0.04 --- n = 391, k = 1 residual sd = 164861.13, R-Squared = 0.62
As you can see, the r2 value is slightly lower in this case than the standard lm
model but in terms of interpreting the coefficients the result makes much more sense. This can be clearly seen in the following graph, where demands for electricity and gas rise during the day as expected, rather than sinking in the basic case.
So there you go. If you ever need to run a regression and ensure that all the coefficients are greater than or equal to zero, nnls is your friend.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.