Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How to train and tune machine learning algorithms in a unified way?
With mlr
R package ????
I am currently keen on automated machine learning, especially hyperparameter optimization. Therefore, recently I mainly focus on frameworks for training models. In this post, I will show how to train ML algorithms and tune them by a grid. I will show only basics, but mlr
package has more sophisticated features, I strongly encourage you to visit mlr webpage and explore all tutorials.
Data set
We will use BreastCancer data set from mlbench
package and will perform binary classification. The aim of the model is to predict whether a cancer is benign or malignant (variable Class
). We remove the first column that contains the id of a patient as it is redundant for modeling. To read more about the data set, see the documentation (?BreastCancer
).
library("mlbench") data("BreastCancer") bc <- na.omit(BreastCancer[ ,-1]) head(bc) ## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei ## 1 5 1 1 1 2 1 ## 2 5 4 4 5 7 10 ## 3 3 1 1 1 2 2 ## 4 6 8 8 1 3 4 ## 5 4 1 1 3 2 1 ## 6 8 10 10 8 7 10 ## Bl.cromatin Normal.nucleoli Mitoses Class ## 1 3 1 1 benign ## 2 3 2 1 benign ## 3 3 1 1 benign ## 4 3 7 1 benign ## 5 3 1 1 benign ## 6 9 7 1 malignant
Installation
First of all make sure, that you have installed mlr
, It is on CRAN, so you can simply use install.packages()
function.
install.packages("mlr")
After installation, load mlr
and set seed to make results reproducible.
library(mlr) set.seed(1)
Modeling
Fitting a model
First, you need to define a task. The task is a definition of a machine learning problem.
We will use makeClassifTask()
function because our problem is classification. For regression, it would be makeRegrTask()
and for clustering makeClusterTask*()
.
In makeClassifTask()
parameter id
define the name of the task, data
is the data model will be trained on and target
indicates the target variable.
classif_task = makeClassifTask(id = "bc", data = bc, target = "Class") classif_task ## Supervised task: bc ## Type: classif ## Target: Class ## Observations: 683 ## Features: ## numerics factors ordered functionals ## 0 4 5 0 ## Missings: FALSE ## Has weights: FALSE ## Has blocking: FALSE ## Has coordinates: FALSE ## Classes: 2 ## benign malignant ## 444 239 ## Positive class: benign
The second step is defining a model. Please, note that we do not train a model yet. We only create an object that describes our algorithm.
classif_lrn = makeLearner("classif.randomForest", par.vals = list(ntree = 200))
In the example above we have created an object that defines classification random forest with 200 trees. To see hyperparameters we can simply use function getParamSet()
. We obtain names of hyperparameters, their ranges, and default values.
getParamSet(classif_lrn) ## Type len Def Constr Req Tunable Trafo ## ntree integer - 500 1 to Inf - TRUE - ## mtry integer - - 1 to Inf - TRUE - ## replace logical - TRUE - - TRUE - ## classwt numericvector <NA> - 0 to Inf - TRUE - ## cutoff numericvector <NA> - 0 to 1 - TRUE - ## strata untyped - - - - FALSE - ## sampsize integervector <NA> - 1 to Inf - TRUE - ## nodesize integer - 1 1 to Inf - TRUE - ## maxnodes integer - - 1 to Inf - TRUE - ## importance logical - FALSE - - TRUE - ## localImp logical - FALSE - - TRUE - ## proximity logical - FALSE - - FALSE - ## oob.prox logical - - - Y FALSE - ## norm.votes logical - TRUE - - FALSE - ## do.trace logical - FALSE - - FALSE - ## keep.forest logical - TRUE - - FALSE - ## keep.inbag logical - FALSE - - FALSE -
Now, we are ready to fit a model. We can just simply use function train()
.
model = train(classif_lrn, classif_task)
Tuning a model
To tune hyperparameters, we need to specify a space fo search. For defining space for integer parameters we use function makeIntegerParam()
. All of this is pinned together with the function makeParamSet()
.
params = makeParamSet( makeIntegerParam("mtry", lower = 1, upper = 100), makeIntegerParam("ntree", lower = 1L, upper = 500L) )
Now, we use function makeTuneControlRandom()
to create an object that define random search. Parameter maxit
defines the number of iterations. Function makeResampleDesc()
create an object for a resampling strategy, in this case cross-validation. Finally, we can combine all of the previous pieces with function tuneParams()
and tune randomForest.
ctrl = makeTuneControlRandom(maxit = 10L) rdesc = makeResampleDesc("CV", iters = 3L) res = tuneParams(classif_lrn, task = classif_task, resampling = rdesc, par.set = params, control = ctrl, measures = list(acc), show.info = FALSE) res ## Tune result: ## Op. pars: mtry=49; ntree=464 ## acc.test.mean=0.9707409
As a result of tuning, we have obtained hyperparameters mtry=49
, ntree=464
.
More
If you would like to learn more about mlr, you can visit mlr webpage.
What is more, a new version of this package is coming up. I highly recommended the following information about mlr3.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.