Build and improve a Machine Learning Classification model with TidyModels and R

Gary Hutson

10 months ago

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

These set of tutorial arose through my desire to use as many machine learning packages as possible. My favourites still remain tensorflow, caret, sci-kit learn and now TidyModels.

Why TidyModels?

Instead of replacing the modelling package, tidymodels replaces the interface. Better said, tidymodels provides a single set of functions and arguments to define a model. It then fits the model against the requested modelling package.

TidyModels takes a packaged approach to the machine learning pipeline. The main steps in every TidyModels journey are as below:

The preprocessing is carried out by packages such as rsample and recipes. The modelling workhorse is parsnip (the Tidy equivalent to caret) and to do the validation – Yardstick has some great features. I would add that caret still has the best implmentation of the confusion matrix tables, with an adapted package – authored by me – entitled ConfusionTableR.

How to learn this approach?

I have created two tutorials that aim to introduce you to the ideas of developing a machine learning classification model.

The aim is to be able to predict whether a patient will be a stranded patient, or not. This dataset is accessed from the NHSRDatasets package available on CRAN and uses the stranded_model dataset.

Building our classification model – Tutorial One

The first tutorial looks at doing the preprocessing steps in caret and using parsnip to fit a simple model. Then, I show you how to evaluate your model with yardstick:

The source code for this can be found on the supporting GitHub.

Improving our classification model – Part Two

The next tutorial looks specifically at:

Getting better resamples with K-Fold cross-validation
Improving R model with the selection of a better model i.e. ensemble random forest
Tuning hyperparameters using the dials package and tuning a decision tree. The formula for cost complexity is detailed here:

The aim of cost complexity is to achieve the trade off between model accuracy and model training time i.e. the more complex the longer to train the ML model, but the accuracy is weaker. Whereas, with the better accuracy models, the model will take a while to converge.

The video hereunder shows you how to get on with improving the accuracy of your models:

“What’s up doc?” How to create ML models in CARET

This tutorial focusses on TidyModels, but CARET is a really powerful machine learning library as well. To learn how to use regression and classification models in CARET, then look no further:

The GitHub repository for CARET is here.

Conclusion

The TidyModels package is great, as it extends the functionality of other modelling packages into one nice neat Tidy framework. I like the preprocessing options available in Recipes and for evaluation Yardstick is excellent for generating nice ROC curves.

Ultimately, you will end up creating these pipelines, doing all the complex ML and then you have the results. You will always get those managers and staff who say:

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.