Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Stepwise regression is a powerful technique used to build predictive models by iteratively adding or removing variables based on statistical criteria. In R, this can be achieved using functions like step()
or manually with forward and backward selection.
Example
< section id="forward-stepwise-regression" class="level2">Forward Stepwise Regression:
# Initialize an empty model forward_model <- lm(mpg ~ ., data = mtcars) # Forward stepwise regression forward_model <- step(forward_model, direction = "forward", scope = formula(~ .))
Start: AIC=70.9 mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
In simple terms, we start with a model containing no predictors (mpg ~ 1
) and iteratively add the most statistically significant variables until no improvement is observed.
Backward Stepwise Regression:
# Initialize a model with all predictors backward_model <- lm(mpg ~ ., data = mtcars) # Backward stepwise regression backward_model <- step(backward_model, direction = "backward")
Start: AIC=70.9 mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - cyl 1 0.0799 147.57 68.915 - vs 1 0.1601 147.66 68.932 - carb 1 0.4067 147.90 68.986 - gear 1 1.3531 148.85 69.190 - drat 1 1.6270 149.12 69.249 - disp 1 3.9167 151.41 69.736 - hp 1 6.8399 154.33 70.348 - qsec 1 8.8641 156.36 70.765 <none> 147.49 70.898 - am 1 10.5467 158.04 71.108 - wt 1 27.0144 174.51 74.280 Step: AIC=68.92 mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - vs 1 0.2685 147.84 66.973 - carb 1 0.5201 148.09 67.028 - gear 1 1.8211 149.40 67.308 - drat 1 1.9826 149.56 67.342 - disp 1 3.9009 151.47 67.750 - hp 1 7.3632 154.94 68.473 <none> 147.57 68.915 - qsec 1 10.0933 157.67 69.032 - am 1 11.8359 159.41 69.384 - wt 1 27.0280 174.60 72.297 Step: AIC=66.97 mpg ~ disp + hp + drat + wt + qsec + am + gear + carb Df Sum of Sq RSS AIC - carb 1 0.6855 148.53 65.121 - gear 1 2.1437 149.99 65.434 - drat 1 2.2139 150.06 65.449 - disp 1 3.6467 151.49 65.753 - hp 1 7.1060 154.95 66.475 <none> 147.84 66.973 - am 1 11.5694 159.41 67.384 - qsec 1 15.6830 163.53 68.200 - wt 1 27.3799 175.22 70.410 Step: AIC=65.12 mpg ~ disp + hp + drat + wt + qsec + am + gear Df Sum of Sq RSS AIC - gear 1 1.565 150.09 63.457 - drat 1 1.932 150.46 63.535 <none> 148.53 65.121 - disp 1 10.110 158.64 65.229 - am 1 12.323 160.85 65.672 - hp 1 14.826 163.35 66.166 - qsec 1 26.408 174.94 68.358 - wt 1 69.127 217.66 75.350 Step: AIC=63.46 mpg ~ disp + hp + drat + wt + qsec + am Df Sum of Sq RSS AIC - drat 1 3.345 153.44 62.162 - disp 1 8.545 158.64 63.229 <none> 150.09 63.457 - hp 1 13.285 163.38 64.171 - am 1 20.036 170.13 65.466 - qsec 1 25.574 175.67 66.491 - wt 1 67.572 217.66 73.351 Step: AIC=62.16 mpg ~ disp + hp + wt + qsec + am Df Sum of Sq RSS AIC - disp 1 6.629 160.07 61.515 <none> 153.44 62.162 - hp 1 12.572 166.01 62.682 - qsec 1 26.470 179.91 65.255 - am 1 32.198 185.63 66.258 - wt 1 69.043 222.48 72.051 Step: AIC=61.52 mpg ~ hp + wt + qsec + am Df Sum of Sq RSS AIC - hp 1 9.219 169.29 61.307 <none> 160.07 61.515 - qsec 1 20.225 180.29 63.323 - am 1 25.993 186.06 64.331 - wt 1 78.494 238.56 72.284 Step: AIC=61.31 mpg ~ wt + qsec + am Df Sum of Sq RSS AIC <none> 169.29 61.307 - am 1 26.178 195.46 63.908 - qsec 1 109.034 278.32 75.217 - wt 1 183.347 352.63 82.790
Here, we begin with a model including all predictors and iteratively remove the least statistically significant variables until the model no longer improves.
< section id="both-direction-stepwise-regression" class="level2">Both-Direction Stepwise Regression:
# Initialize a model with all predictors both_model <- lm(mpg ~ ., data = mtcars) # Both-direction stepwise regression both_model <- step(both_model, direction = "both")
Start: AIC=70.9 mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - cyl 1 0.0799 147.57 68.915 - vs 1 0.1601 147.66 68.932 - carb 1 0.4067 147.90 68.986 - gear 1 1.3531 148.85 69.190 - drat 1 1.6270 149.12 69.249 - disp 1 3.9167 151.41 69.736 - hp 1 6.8399 154.33 70.348 - qsec 1 8.8641 156.36 70.765 <none> 147.49 70.898 - am 1 10.5467 158.04 71.108 - wt 1 27.0144 174.51 74.280 Step: AIC=68.92 mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb Df Sum of Sq RSS AIC - vs 1 0.2685 147.84 66.973 - carb 1 0.5201 148.09 67.028 - gear 1 1.8211 149.40 67.308 - drat 1 1.9826 149.56 67.342 - disp 1 3.9009 151.47 67.750 - hp 1 7.3632 154.94 68.473 <none> 147.57 68.915 - qsec 1 10.0933 157.67 69.032 - am 1 11.8359 159.41 69.384 + cyl 1 0.0799 147.49 70.898 - wt 1 27.0280 174.60 72.297 Step: AIC=66.97 mpg ~ disp + hp + drat + wt + qsec + am + gear + carb Df Sum of Sq RSS AIC - carb 1 0.6855 148.53 65.121 - gear 1 2.1437 149.99 65.434 - drat 1 2.2139 150.06 65.449 - disp 1 3.6467 151.49 65.753 - hp 1 7.1060 154.95 66.475 <none> 147.84 66.973 - am 1 11.5694 159.41 67.384 - qsec 1 15.6830 163.53 68.200 + vs 1 0.2685 147.57 68.915 + cyl 1 0.1883 147.66 68.932 - wt 1 27.3799 175.22 70.410 Step: AIC=65.12 mpg ~ disp + hp + drat + wt + qsec + am + gear Df Sum of Sq RSS AIC - gear 1 1.565 150.09 63.457 - drat 1 1.932 150.46 63.535 <none> 148.53 65.121 - disp 1 10.110 158.64 65.229 - am 1 12.323 160.85 65.672 - hp 1 14.826 163.35 66.166 + carb 1 0.685 147.84 66.973 + vs 1 0.434 148.09 67.028 + cyl 1 0.414 148.11 67.032 - qsec 1 26.408 174.94 68.358 - wt 1 69.127 217.66 75.350 Step: AIC=63.46 mpg ~ disp + hp + drat + wt + qsec + am Df Sum of Sq RSS AIC - drat 1 3.345 153.44 62.162 - disp 1 8.545 158.64 63.229 <none> 150.09 63.457 - hp 1 13.285 163.38 64.171 + gear 1 1.565 148.53 65.121 + cyl 1 1.003 149.09 65.242 + vs 1 0.645 149.45 65.319 + carb 1 0.107 149.99 65.434 - am 1 20.036 170.13 65.466 - qsec 1 25.574 175.67 66.491 - wt 1 67.572 217.66 73.351 Step: AIC=62.16 mpg ~ disp + hp + wt + qsec + am Df Sum of Sq RSS AIC - disp 1 6.629 160.07 61.515 <none> 153.44 62.162 - hp 1 12.572 166.01 62.682 + drat 1 3.345 150.09 63.457 + gear 1 2.977 150.46 63.535 + cyl 1 2.447 150.99 63.648 + vs 1 1.121 152.32 63.927 + carb 1 0.011 153.43 64.160 - qsec 1 26.470 179.91 65.255 - am 1 32.198 185.63 66.258 - wt 1 69.043 222.48 72.051 Step: AIC=61.52 mpg ~ hp + wt + qsec + am Df Sum of Sq RSS AIC - hp 1 9.219 169.29 61.307 <none> 160.07 61.515 + disp 1 6.629 153.44 62.162 + carb 1 3.227 156.84 62.864 + drat 1 1.428 158.64 63.229 - qsec 1 20.225 180.29 63.323 + cyl 1 0.249 159.82 63.465 + vs 1 0.249 159.82 63.466 + gear 1 0.171 159.90 63.481 - am 1 25.993 186.06 64.331 - wt 1 78.494 238.56 72.284 Step: AIC=61.31 mpg ~ wt + qsec + am Df Sum of Sq RSS AIC <none> 169.29 61.307 + hp 1 9.219 160.07 61.515 + carb 1 8.036 161.25 61.751 + disp 1 3.276 166.01 62.682 + cyl 1 1.501 167.78 63.022 + drat 1 1.400 167.89 63.042 + gear 1 0.123 169.16 63.284 + vs 1 0.000 169.29 63.307 - am 1 26.178 195.46 63.908 - qsec 1 109.034 278.32 75.217 - wt 1 183.347 352.63 82.790
In both-direction regression, the algorithm combines both forward and backward steps, optimizing the model by adding significant variables and removing insignificant ones.
< section id="visualizing-data-and-model-fit" class="level4">Visualizing Data and Model Fit:
Now, let’s visualize the data and model fit using base R plots.
# Scatter plot of mpg vs. hp plot(mtcars$hp, mtcars$mpg, main = "Scatter Plot of mpg vs. hp", xlab = "hp", ylab = "mpg", pch = 20 ) abline(lm(mpg ~ hp, data = mtcars), col = "black", lwd = 2) points(sort(mtcars$hp), forward_model$fitted.values, col = "red", pch = 20) points(sort(mtcars$hp), backward_model$fitted.values, col = "blue", pch = 20) points(sort(mtcars$hp), both_model$fitted.values, col = "green", pch = 20) legend("topright", legend = c("Forward", "Backward", "Both-Direction"), col = c("red", "blue", "green"), pch = 20)
This plot displays the scatter plot of mpg
against hp
with fitted lines for each stepwise regression. The colors correspond to the models created earlier.
Visualizing Residuals:
# Residual plots for each model par(mfrow = c(2, 2)) # Forward stepwise regression residuals plot(forward_model$residuals, main = "Forward Residuals", ylab = "Residuals") # Backward stepwise regression residuals plot(backward_model$residuals, main = "Backward Residuals", ylab = "Residuals") # Both-direction stepwise regression residuals plot(both_model$residuals, main = "Both-Direction Residuals", ylab = "Residuals")
These plots help assess how well the models fit the data by examining the residuals.
< section id="conclusion" class="level1">Conclusion
Stepwise regression is a valuable tool, but it’s crucial to interpret results cautiously and be aware of potential pitfalls.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.