Site icon R-bloggers

Variable selection using automatic methods

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.

It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.

The R package leaps has a function regsubsets that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.

In previous post we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:

> require(leaps)
> require(MASS)

First up we consider selecting the best subset of a particular size, say four variables for illustrative purposes (nvmax argument), and we specify the largest possible model which in this example has six variables:

regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, nvmax = 4)

A summary for the output from this function is shown here:

> summary(reg1)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, nvmax = 4)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  " "  "*"  " "  " "   " "  
2  ( 1 ) " "  " "  "*"  "*"  " "   " "  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"

The function regsubsets identifies the variables mmin, mmax, cach and chmax as the best four.

Alternatively we could perform a backwards elimination and the function will indicate the best subset of a particular size, from one to six variables in this example:

> reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "backward")
> summary(reg2)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "backward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  "*"  " "  " "  " "   " "  
2  ( 1 ) " "  "*"  " "  " "  " "   "*"  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

The subset of four variables is the same for this example as the best subsets approach. The third approach if forward selection:

> reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = "backward")
> summary(reg3)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = "backward")
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) " "  "*"  " "  " "  " "   " "  
2  ( 1 ) " "  "*"  " "  " "  " "   "*"  
3  ( 1 ) " "  "*"  "*"  " "  " "   "*"  
4  ( 1 ) " "  "*"  "*"  "*"  " "   "*"  
5  ( 1 ) "*"  "*"  "*"  "*"  " "   "*"  
6  ( 1 ) "*"  "*"  "*"  "*"  "*"   "*"

For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.