Site icon R-bloggers

FeatureTerminatoR – a package to remove unimportant variables from statistical and machine learning models automatically

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The motivation for this package is simple, while there are many packages that do similar things, few of them perform automated removal of the features from your models. This was the motivation, plus having them all in one location to enable you to easily find them, otherwise you would be looking through Caret, Tidymodels and mlr3 documentation all day long.

What the package does?

The package, as of today, has two feature selection methods baked into it. These will be described in detail hereunder.

Recursive Feature Ellimination

The trick to this is to use cross-validation, or repeated cross-validation, to eliminate n features from the model. This is achieved by fitting the model multiple times at each step, removing the weakest features, determining by either the coefficients in the model, or by the feature importance attributes in the model.

Within the package there is a number of different types you can utilise:

See the underlying caretFuncs() documentation.

The model implements all these methods. I will utilise the random forest variable importance selection method, as this is quick to train on our test dataset.

Removing High Correlated Features – multicol_terminatoR

The main reason you would want to do this is to avoid multicollinearity. This is an effect caused when there are high intercorrelations among two or more independent variables in linear models, this is not so much of a problem with non-linear models, such as trees, but can still cause high variance in the models, thus scaling of independent variables is always recommended.

Why bother about multicollinearity?

In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.

Key takeaways:

This is why you would want to remove highly correlated features.

Learn how to use the package?

The associated vignette is the best place to learn about all the features and how to use them in the model. Also, see the supporting GitHub for help with installation and getting started.

Alternatively, I have made a YouTube video for this version to get you familiar with how the package works:

What’s next for the package?

The package is at its first version on CRAN. However, I have plans for the next set of development.

The package can be downloaded from CRAN, or via your R environment.

aj is the coefficient of the j-th feature. The final term is called l1 penalty and α is a hyperparameter that tunes the intensity of this penalty term. The higher the coefficient of a feature, the higher the value of the cost function. 

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.