mlr: Machine Learning in R – basics

R on FeelML

3 years ago

[This article was first published on R on FeelML, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to train and tune machine learning algorithms in a unified way? With mlr R package ????

I am currently keen on automated machine learning, especially hyperparameter optimization. Therefore, recently I mainly focus on frameworks for training models. In this post, I will show how to train ML algorithms and tune them by a grid. I will show only basics, but mlr package has more sophisticated features, I strongly encourage you to visit mlr webpage and explore all tutorials.

Data set

We will use BreastCancer data set from mlbench package and will perform binary classification. The aim of the model is to predict whether a cancer is benign or malignant (variable Class). We remove the first column that contains the id of a patient as it is redundant for modeling. To read more about the data set, see the documentation (?BreastCancer).

library("mlbench")
data("BreastCancer")
bc <- na.omit(BreastCancer[ ,-1])

head(bc)
##   Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
## 1            5         1          1             1            2           1
## 2            5         4          4             5            7          10
## 3            3         1          1             1            2           2
## 4            6         8          8             1            3           4
## 5            4         1          1             3            2           1
## 6            8        10         10             8            7          10
##   Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           3               1       1    benign
## 2           3               2       1    benign
## 3           3               1       1    benign
## 4           3               7       1    benign
## 5           3               1       1    benign
## 6           9               7       1 malignant

Installation

First of all make sure, that you have installed mlr, It is on CRAN, so you can simply use install.packages() function.

install.packages("mlr")

After installation, load mlr and set seed to make results reproducible.

library(mlr)
set.seed(1)

Modeling

Fitting a model

First, you need to define a task. The task is a definition of a machine learning problem. We will use makeClassifTask() function because our problem is classification. For regression, it would be makeRegrTask() and for clustering makeClusterTask*().

In makeClassifTask() parameter id define the name of the task, data is the data model will be trained on and target indicates the target variable.

classif_task = makeClassifTask(id = "bc", data = bc, target = "Class")
classif_task
## Supervised task: bc
## Type: classif
## Target: Class
## Observations: 683
## Features:
##    numerics     factors     ordered functionals 
##           0           4           5           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
##    benign malignant 
##       444       239 
## Positive class: benign

The second step is defining a model. Please, note that we do not train a model yet. We only create an object that describes our algorithm.

classif_lrn = makeLearner("classif.randomForest", par.vals = list(ntree = 200))

In the example above we have created an object that defines classification random forest with 200 trees. To see hyperparameters we can simply use function getParamSet(). We obtain names of hyperparameters, their ranges, and default values.

getParamSet(classif_lrn)
##                      Type  len   Def   Constr Req Tunable Trafo
## ntree             integer    -   500 1 to Inf   -    TRUE     -
## mtry              integer    -     - 1 to Inf   -    TRUE     -
## replace           logical    -  TRUE        -   -    TRUE     -
## classwt     numericvector <NA>     - 0 to Inf   -    TRUE     -
## cutoff      numericvector <NA>     -   0 to 1   -    TRUE     -
## strata            untyped    -     -        -   -   FALSE     -
## sampsize    integervector <NA>     - 1 to Inf   -    TRUE     -
## nodesize          integer    -     1 1 to Inf   -    TRUE     -
## maxnodes          integer    -     - 1 to Inf   -    TRUE     -
## importance        logical    - FALSE        -   -    TRUE     -
## localImp          logical    - FALSE        -   -    TRUE     -
## proximity         logical    - FALSE        -   -   FALSE     -
## oob.prox          logical    -     -        -   Y   FALSE     -
## norm.votes        logical    -  TRUE        -   -   FALSE     -
## do.trace          logical    - FALSE        -   -   FALSE     -
## keep.forest       logical    -  TRUE        -   -   FALSE     -
## keep.inbag        logical    - FALSE        -   -   FALSE     -

Now, we are ready to fit a model. We can just simply use function train().

model = train(classif_lrn, classif_task)

Tuning a model

To tune hyperparameters, we need to specify a space fo search. For defining space for integer parameters we use function makeIntegerParam(). All of this is pinned together with the function makeParamSet().

params = makeParamSet(
  makeIntegerParam("mtry", lower = 1, upper = 100),
  makeIntegerParam("ntree", lower = 1L, upper = 500L)
)

Now, we use function makeTuneControlRandom() to create an object that define random search. Parameter maxit defines the number of iterations. Function makeResampleDesc() create an object for a resampling strategy, in this case cross-validation. Finally, we can combine all of the previous pieces with function tuneParams() and tune randomForest.

ctrl = makeTuneControlRandom(maxit = 10L)
rdesc = makeResampleDesc("CV", iters = 3L)

res = tuneParams(classif_lrn, task = classif_task, 
                 resampling = rdesc, 
                 par.set = params,
                 control = ctrl,
                 measures = list(acc),
                 show.info = FALSE)
res
## Tune result:
## Op. pars: mtry=49; ntree=464
## acc.test.mean=0.9707409

As a result of tuning, we have obtained hyperparameters mtry=49, ntree=464.

If you would like to learn more about mlr, you can visit mlr webpage.
What is more, a new version of this package is coming up. I highly recommended the following information about mlr3.

To leave a comment for the author, please follow the link and comment on their blog: R on FeelML.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.