bounceR 0.1.2: Automated Feature Selection

Lukas Strömsdörfer

4 years ago

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New Features

As promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by H2o.ai feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.

Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius future package at this year's eRum conference. A couple days later, Davis Vaughan released furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius purrr package. Davis' package combines purrr's maping functions with future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of bounceR is now leveraging furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.

Practical Example

Thus, lets see how you can use it now. Let's start by downloading the package.

# you need devtools, cause we are just about to get it to CRAN, though we are not that far
library(devtools)

# now you are good to go
devtools::install_github("STATWORX/bounceR")

# now you can source it like every normal package
library(bounceR)

To show how the feature selection works, we now need some data, so lets simulate some with our sim_data() function.

# simulate some data
data <- sim_data(n = 100,
                 modelvars = 10,
                 noisevars = 300)

Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.

# run our algorithm
selected_features <- featureSelection(data = data,
                                      target = "y",
                                      max_time = "30 mins",
                                      bootstrap = "regular",
                                      early_stopping = TRUE,
                                      parallel = TRUE)

What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.

Outlook

I hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.

Cheers Guys!

Über den Autor

Lukas Strömsdörfer

Lukas ist im Data Science Team und promoviert gerade extern an der Uni Göttingen. In seiner Freizeit fährt er leidenschaftlich gerne Fahrrad und schaut Serien.

Der Beitrag bounceR 0.1.2: Automated Feature Selection erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.