Site icon R-bloggers

How to do feature engineering in R with the recipes package

[This article was first published on Ortom | R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was excited to start using Max Khun (creator of Caret’s) new set of ‘tidymodels’ packages – rsample, recipe, yardstick, parsnip and dials. These are still under development but seem promising. The one I have so far found most useful is recipe. Here I’ll give a quick overview of how you use it to do some simple data preparation for machine learning.

R’s approach to machine learning has always been a bit haphazard and fragmented. There has never been an equivalent to python’s scikit-learn. I have never really got along with caret (the main contender) or mlr. I found the API difficult to learn and I’ve never liked the amount of control you give up as a result of using them. I like the fact that these new set of packages are modular and so can be used without fully giving up on other approaches.

Recipe

Basically, recipe provides a bunch of tools for preparing data and creating design matrices. This is a form of feature engineering. These matrices can then be used as training data for ML models. This is done in four steps:

  1. Create a recipe made up of steps (eg. missing data imputation and skew correction – many are provided in the package)
  2. Prep that recipe using the training data (eg. use the training data to learn imputation values)
  3. Create a model matrix by applying the prepped recipe to the training data
  4. (Optional) Create another model matrix using the same steps but applied to a new dataset (a test or production dataset say).

Here is a quick example the does median imputation, centres and scales the airquality dataset to give an idea for how it would work.

library(recipes)
aq_train = airquality[1:100, ]
aq_test = airquality[-(1:100), ]

#make recipe
recipe_1 = recipe(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day,
                  data = aq_train) %>%
  #add steps
  step_medianimpute(all_numeric()) %>%
  step_center(all_numeric())  %>%
  step_scale( all_numeric())  %>%
  #prep recipe
  prep(training = aq_train, retain = TRUE,  verbose = TRUE)

#make model matrices
mm_train = bake(recipe_1, new_data = aq_train, composition = 'matrix')
mm_test = bake(recipe_1, new_data = aq_test, composition = 'matrix')

After doing this you can go off and do what you want with the model matrix. Changing the composition argument allows you to get a ““tibble”, “matrix”, “data.frame”, or “dgCMatrix”.

This approach is flexible and allows a prepped recipe to be applied to a new dataset avoiding data leakage problems. A list of available functions is here. User defined functions can also be made.

The recipe package is really useful and i’ve been using it a lot lately – it has streamlined a bit of my workflow that I’d been struggling with. It still has a few rough edges but is really worth trying out.

To leave a comment for the author, please follow the link and comment on their blog: Ortom | R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.