Regression in R-Ultimate Guide
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Regression in R, In a recent article, we discussed model fitting and selection. However, we haven’t considered how we’ll choose which variables to include in our model.
Simple Linear Regression in r » Guide »
Let’s go over our linear regression model for the mtcars data.frame again.
Regression in R
mtcars.lm <- lm(mpg ~ disp, data=mtcars) summary(mtcars.lm) Call: lm(formula = mpg ~ disp, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.8922 -2.2022 -0.9631 1.6272 7.2305 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.599855 1.229720 24.070 < 2e-16 *** disp -0.041215 0.004712 -8.747 9.38e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.251 on 30 degrees of freedom Multiple R-squared: 0.7183, Adjusted R-squared: 0.709 F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
We can observe that disp is a very significant predictor of mpg based on its associated p-value of 9.38e-10. But what exactly do AIC and BIC tell us?
Model Selection in R (AIC Vs BIC) »
broom::glance(lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, data=mtcars)) r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> 1 0.869 0.807 2.65 13.9 0.000000379 10 -69.9 164. 181. 147. 21 32
Sure, we can see that AIC = 164 and BIC = 181, but those figures don’t tell us whether disp is a good or bad predictor of mpg.
It’s also unclear which parameter is important. Stepwise regression is one solution to this problem.
1. Stepwise forward regression in R
We can start with a basic model in forward selection.
How to Identify Outliers-Grubbs’ Test in R »
lm<- lm(mpg ~ 1, data=mtcars) summary(lm) Call: lm(formula = mpg ~ 1, data = mtcars) Residuals: Min 1Q Median 3Q Max -9.6906 -4.6656 -0.8906 2.7094 13.8094 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 20.091 1.065 18.86 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.027 on 31 degrees of freedom
Now we can calculate AIC (or BIC) for this model.
Log Rank Test in R-Survival Curve Comparison »
AIC(lm) [1] 208.7555 BIC(lm) [1] 211.687
We want to know if adding parameters will improve our model’s explanatory power without introducing undue complexity.
In this dataset consider all variables and choose the model that reduces AIC the most.
We repeat this approach until we run out of variables to add to our model or until no variable with a lower AIC is available to choose from.
We can automate this procedure using the stats package’s step() function.
step(lm, scope=list(upper=lm(mpg ~ ., data=mtcars)), direction="forward") Start: AIC=115.94 mpg ~ 1 Df Sum of Sq RSS AIC + wt 1 847.73 278.32 73.217 + cyl 1 817.71 308.33 76.494 + disp 1 808.89 317.16 77.397 + hp 1 678.37 447.67 88.427 + drat 1 522.48 603.57 97.988 + vs 1 496.53 629.52 99.335 + am 1 405.15 720.90 103.672 + carb 1 341.78 784.27 106.369 + gear 1 259.75 866.30 109.552 + qsec 1 197.39 928.66 111.776 <none> 1126.05 115.943 Step: AIC=73.22 mpg ~ wt Df Sum of Sq RSS AIC + cyl 1 87.150 191.17 63.198 + hp 1 83.274 195.05 63.840 + qsec 1 82.858 195.46 63.908 + vs 1 54.228 224.09 68.283 + carb 1 44.602 233.72 69.628 + disp 1 31.639 246.68 71.356 <none> 278.32 73.217 + drat 1 9.081 269.24 74.156 + gear 1 1.137 277.19 75.086 + am 1 0.002 278.32 75.217 Step: AIC=63.2 mpg ~ wt + cyl Df Sum of Sq RSS AIC + hp 1 14.5514 176.62 62.665 + carb 1 13.7724 177.40 62.805 <none> 191.17 63.198 + qsec 1 10.5674 180.60 63.378 + gear 1 3.0281 188.14 64.687 + disp 1 2.6796 188.49 64.746 + vs 1 0.7059 190.47 65.080 + am 1 0.1249 191.05 65.177 + drat 1 0.0010 191.17 65.198 Step: AIC=62.66 mpg ~ wt + cyl + hp Df Sum of Sq RSS AIC <none> 176.62 62.665 + am 1 6.6228 170.00 63.442 + disp 1 6.1762 170.44 63.526 + carb 1 2.5187 174.10 64.205 + drat 1 2.2453 174.38 64.255 + qsec 1 1.4010 175.22 64.410 + gear 1 0.8558 175.76 64.509 + vs 1 0.0599 176.56 64.654 Call: lm(formula = mpg ~ wt + cyl + hp, data = mtcars) Coefficients: (Intercept) wt cyl hp 38.75179 -3.16697 -0.94162 -0.01804
We can see step() has been calculated for our basic model, as well as AIC for each variable we may add to our model, in the first stage of the process.
Wt has the lowest AIC (73.217), hence it is included in our model. The variables cyl and hp are added in subsequent steps of the procedure.
The model now has mpg wt + cyl + hp and is mpg wt + cyl + hp. We can see that adding more variables will raise the AIC, so we’ll stop here and use this as our final model.
How to Make Boxplot in R-Quick Start Guide »
Please note, the AIC values produced with step() differ from those calculated with glance() or AIC (). This is inconvenient, although the values only differ by a small amount. However, inference won’t change much.
2. Stepwise backward elimination in R
In backward elimination, we can start with a comprehensive model.
lmfull <- lm(mpg ~ ., data=mtcars) summary(lmfull) Call: lm(formula = mpg ~ ., data = mtcars) Residuals: Min 1Q Median 3Q Max -3.4506 -1.6044 -0.1196 1.2193 4.6271 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.30337 18.71788 0.657 0.5181 cyl -0.11144 1.04502 -0.107 0.9161 disp 0.01334 0.01786 0.747 0.4635 hp -0.02148 0.02177 -0.987 0.3350 drat 0.78711 1.63537 0.481 0.6353 wt -3.71530 1.89441 -1.961 0.0633 . qsec 0.82104 0.73084 1.123 0.2739 vs 0.31776 2.10451 0.151 0.8814 am 2.52023 2.05665 1.225 0.2340 gear 0.65541 1.49326 0.439 0.6652 carb -0.19942 0.82875 -0.241 0.8122 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.65 on 21 degrees of freedom Multiple R-squared: 0.869, Adjusted R-squared: 0.8066 F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07 AIC(lmfull) [1] 163.7098 BIC(lmfull) [1] 181.2986
Here we want to see if deleting factors will reduce the complexity of our model without reducing its explanatory power.
We repeat this approach until we run out of variables to remove from our model or there are no variables left to remove that reduce AIC.
aggregate Function in R- A powerful tool for data frames »
step(lmfull, scope=list(lower=lm), direction="backward") Start: AIC=70.9 mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - cyl 1 0.0799 147.57 68.915 - vs 1 0.1601 147.66 68.932 - carb 1 0.4067 147.90 68.986 - gear 1 1.3531 148.85 69.190 - drat 1 1.6270 149.12 69.249 - disp 1 3.9167 151.41 69.736 - hp 1 6.8399 154.33 70.348 - qsec 1 8.8641 156.36 70.765 <none> 147.49 70.898 - am 1 10.5467 158.04 71.108 - wt 1 27.0144 174.51 74.280 Step: AIC=68.92 mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - vs 1 0.2685 147.84 66.973 - carb 1 0.5201 148.09 67.028 - gear 1 1.8211 149.40 67.308 - drat 1 1.9826 149.56 67.342 - disp 1 3.9009 151.47 67.750 - hp 1 7.3632 154.94 68.473 <none> 147.57 68.915 - qsec 1 10.0933 157.67 69.032 - am 1 11.8359 159.41 69.384 - wt 1 27.0280 174.60 72.297 Step: AIC=66.97 mpg ~ disp + hp + drat + wt + qsec + am + gear + carb Df Sum of Sq RSS AIC - carb 1 0.6855 148.53 65.121 - gear 1 2.1437 149.99 65.434 - drat 1 2.2139 150.06 65.449 - disp 1 3.6467 151.49 65.753 - hp 1 7.1060 154.95 66.475 <none> 147.84 66.973 - am 1 11.5694 159.41 67.384 - qsec 1 15.6830 163.53 68.200 - wt 1 27.3799 175.22 70.410 Step: AIC=65.12 mpg ~ disp + hp + drat + wt + qsec + am + gear Df Sum of Sq RSS AIC - gear 1 1.565 150.09 63.457 - drat 1 1.932 150.46 63.535 <none> 148.53 65.121 - disp 1 10.110 158.64 65.229 - am 1 12.323 160.85 65.672 - hp 1 14.826 163.35 66.166 - qsec 1 26.408 174.94 68.358 - wt 1 69.127 217.66 75.350 Step: AIC=63.46 mpg ~ disp + hp + drat + wt + qsec + am Df Sum of Sq RSS AIC - drat 1 3.345 153.44 62.162 - disp 1 8.545 158.64 63.229 <none> 150.09 63.457 - hp 1 13.285 163.38 64.171 - am 1 20.036 170.13 65.466 - qsec 1 25.574 175.67 66.491 - wt 1 67.572 217.66 73.351 Step: AIC=62.16 mpg ~ disp + hp + wt + qsec + am Df Sum of Sq RSS AIC - disp 1 6.629 160.07 61.515 <none> 153.44 62.162 - hp 1 12.572 166.01 62.682 - qsec 1 26.470 179.91 65.255 - am 1 32.198 185.63 66.258 - wt 1 69.043 222.48 72.051 Step: AIC=61.52 mpg ~ hp + wt + qsec + am Df Sum of Sq RSS AIC - hp 1 9.219 169.29 61.307 <none> 160.07 61.515 - qsec 1 20.225 180.29 63.323 - am 1 25.993 186.06 64.331 - wt 1 78.494 238.56 72.284 Step: AIC=61.31 mpg ~ wt + qsec + am Df Sum of Sq RSS AIC <none> 169.29 61.307 - am 1 26.178 195.46 63.908 - qsec 1 109.034 278.32 75.217 - wt 1 183.347 352.63 82.790 Call: lm(formula = mpg ~ wt + qsec + am, data = mtcars) Coefficients: (Intercept) wt qsec am 9.618 -3.917 1.226 2.936
We begin with a full regression model (AIC = 70.9) and end with mpg wt + qsec + am (AIC = 61.31). Did you notice in the backward selection we got a different model!.
3. Bidirectional Elimination in R
Assume we already have a model.
lm.mtcars <- lm(mpg ~ disp + cyl + qsec, data=mtcars) summary(lm.mtcars)
We wish to reduce the AIC by adding or removing variables from this model. We may create a new algorithm called bidirectional elimination by combining forward selection and backward elimination.
We will either add a new variable to our model or delete an existing variable from our model at each stage of this new method, taking the action that results in the greatest reduction in AIC.
Radar Chart in R with ggradar » Beautiful Radar Chart in R »
Call: lm(formula = mpg ~ disp + cyl + qsec, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.6040 -1.8180 -0.5852 1.4493 7.2779 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 40.17553 9.48449 4.236 0.000223 *** disp -0.01871 0.01082 -1.729 0.094870 . cyl -1.84802 0.83926 -2.202 0.036071 * qsec -0.24276 0.40183 -0.604 0.550625 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.089 on 28 degrees of freedom Multiple R-squared: 0.7627, Adjusted R-squared: 0.7372 F-statistic: 29.99 on 3 and 28 DF, p-value: 6.882e-09
We add a new variable, wt, to our model in the first step of our bidirectional elimination process.
step(lm.mtcars, scope=list(upper=lmfull,lower=lm), direction="both") Start: AIC=75.92 mpg ~ disp + cyl + qsec Df Sum of Sq RSS AIC + wt 1 91.588 175.67 64.492 + carb 1 52.250 215.01 70.958 - qsec 1 3.484 270.74 74.334 + hp 1 24.724 242.53 74.813 + am 1 20.026 247.23 75.427 <none> 267.26 75.919 - disp 1 28.525 295.78 77.164 + drat 1 3.167 264.09 77.538 + gear 1 1.083 266.17 77.789 + vs 1 0.000 267.26 77.919 - cyl 1 46.280 313.54 79.030 Step: AIC=64.49 mpg ~ disp + cyl + qsec + wt Df Sum of Sq RSS AIC - disp 1 4.936 180.60 63.378 + am 1 14.254 161.41 63.784 <none> 175.67 64.492 - qsec 1 12.824 188.49 64.746 + hp 1 6.983 168.69 65.194 + drat 1 3.448 172.22 65.857 - cyl 1 19.794 195.46 65.908 + gear 1 1.369 174.30 66.241 + vs 1 1.259 174.41 66.261 + carb 1 1.123 174.54 66.286 - wt 1 91.588 267.26 75.919 Step: AIC=63.38 mpg ~ cyl + qsec + wt Df Sum of Sq RSS AIC + am 1 12.820 167.78 63.022 - qsec 1 10.567 191.17 63.198 <none> 180.60 63.378 - cyl 1 14.859 195.46 63.908 + hp 1 5.385 175.22 64.410 + disp 1 4.936 175.67 64.492 + carb 1 4.552 176.05 64.562 + drat 1 2.669 177.94 64.902 + vs 1 1.245 179.36 65.157 + gear 1 0.331 180.27 65.320 - wt 1 115.177 295.78 77.164 Step: AIC=63.02 mpg ~ cyl + qsec + wt + am Df Sum of Sq RSS AIC - cyl 1 1.501 169.29 61.307 <none> 167.78 63.022 + carb 1 9.042 158.74 63.250 - am 1 12.820 180.60 63.378 + hp 1 7.967 159.82 63.465 + disp 1 6.371 161.41 63.784 + gear 1 0.808 166.98 64.868 + drat 1 0.603 167.18 64.907 + vs 1 0.268 167.52 64.971 - qsec 1 23.262 191.05 65.177 - wt 1 99.706 267.49 75.947 Step: AIC=61.31 mpg ~ qsec + wt + am Df Sum of Sq RSS AIC <none> 169.29 61.307 + hp 1 9.219 160.07 61.515 + carb 1 8.036 161.25 61.751 + disp 1 3.276 166.01 62.682 + cyl 1 1.501 167.78 63.022 + drat 1 1.400 167.89 63.042 + gear 1 0.123 169.16 63.284 + vs 1 0.000 169.29 63.307 - am 1 26.178 195.46 63.908 - qsec 1 109.034 278.32 75.217 - wt 1 183.347 352.63 82.790 Call: lm(formula = mpg ~ qsec + wt + am, data = mtcars) Coefficients: (Intercept) qsec wt am 9.618 1.226 -3.917 2.936
We eliminate a variable, disp, in the second step. We eventually come to a halt at the model mpg~qsec + wt + am, since there is no variable that can be added or removed to reduce AIC.
Conclusion
In conclusion, you now have three alternative stepwise regression methods to choose from, each of which can produce different results.
How do you decide which to use?
The answer is simple, It’s entirely up to you!. It comes down to personal preference and the problem you’re trying to solve.
Sentiment analysis in R » Complete Tutorial »
The post Regression in R-Ultimate Guide appeared first on finnstats.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.