mlr: Machine Learning in R – basics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How to train and tune machine learning algorithms in a unified way?
With mlr
R package ????
I am currently keen on automated machine learning, especially hyperparameter optimization. Therefore, recently I mainly explore frameworks for unified model training. In this post, I will show how to train ML algorithms and tune them using grid-search. I am going to show only basics, but mlr
package has more sophisticated features, I strongly encourage you to visit mlr webpage and explore all tutorials.
Data set
We will use BreastCancer data set from mlbench
package and will perform binary classification. The aim of the model is to predict whether a cancer is benign or malignant (variable Class
). It is worth to remove the first column that contains the id of a patient as it is redundant for modeling. To read more about the data set, see the documentation (?BreastCancer
).
library("mlbench") data("BreastCancer") bc <- na.omit(BreastCancer[ ,-1]) head(bc) ## Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei ## 1 5 1 1 1 2 1 ## 2 5 4 4 5 7 10 ## 3 3 1 1 1 2 2 ## 4 6 8 8 1 3 4 ## 5 4 1 1 3 2 1 ## 6 8 10 10 8 7 10 ## Bl.cromatin Normal.nucleoli Mitoses Class ## 1 3 1 1 benign ## 2 3 2 1 benign ## 3 3 1 1 benign ## 4 3 7 1 benign ## 5 3 1 1 benign ## 6 9 7 1 malignant
Installation
First of all, make sure that you have installed mlr
, It is on CRAN, so you can simply use install.packages()
function.
install.packages("mlr")
After installation, load mlr
and set seed to make results reproducible.
library(mlr) set.seed(1)
Modeling
Fitting a model
First, you need to define a task. The task is the definition of a machine learning problem.
Our problem is classification, therefor we use makeClassifTask()
function. For regression, it would be makeRegrTask()
and for clustering makeClusterTask()
.
Parameter id
define the name of the task, data
is the data model will be trained on and target
indicates the target variable.
classif_task = makeClassifTask(id = "bc", data = bc, target = "Class") classif_task ## Supervised task: bc ## Type: classif ## Target: Class ## Observations: 683 ## Features: ## numerics factors ordered functionals ## 0 4 5 0 ## Missings: FALSE ## Has weights: FALSE ## Has blocking: FALSE ## Has coordinates: FALSE ## Classes: 2 ## benign malignant ## 444 239 ## Positive class: benign
The second step is defining a model. Please, note that we do not train a model yet. We only create an object that describes our algorithm.
classif_lrn = makeLearner("classif.randomForest", par.vals = list(ntree = 200))
In the example above, we have created an object that defines classification random forest with 200 trees. To get names of hyperparameters, their ranges, and default values use function getParamSet()
.
getParamSet(classif_lrn) ## Type len Def Constr Req Tunable Trafo ## ntree integer - 500 1 to Inf - TRUE - ## mtry integer - - 1 to Inf - TRUE - ## replace logical - TRUE - - TRUE - ## classwt numericvector <NA> - 0 to Inf - TRUE - ## cutoff numericvector <NA> - 0 to 1 - TRUE - ## strata untyped - - - - FALSE - ## sampsize integervector <NA> - 1 to Inf - TRUE - ## nodesize integer - 1 1 to Inf - TRUE - ## maxnodes integer - - 1 to Inf - TRUE - ## importance logical - FALSE - - TRUE - ## localImp logical - FALSE - - TRUE - ## proximity logical - FALSE - - FALSE - ## oob.prox logical - - - Y FALSE - ## norm.votes logical - TRUE - - FALSE - ## do.trace logical - FALSE - - FALSE - ## keep.forest logical - TRUE - - FALSE - ## keep.inbag logical - FALSE - - FALSE -
Now, we are ready to fit a model. We can just simply use function train()
with specified model and task.
model = train(classif_lrn, classif_task)
Tuning a model
To tune hyperparameters, we need to specify a space fo search. For defining space for integer parameters we use function makeIntegerParam()
. All of this is pinned together with the function makeParamSet()
.
params = makeParamSet( makeIntegerParam("mtry", lower = 1, upper = 100), makeIntegerParam("ntree", lower = 1L, upper = 500L) )
Now, we use function makeTuneControlRandom()
to create an object that define random search. Parameter maxit
defines the number of iterations. Function makeResampleDesc()
creates an object for a resampling strategy, in this case cross-validation. Finally, we can combine all of the previous pieces with function tuneParams()
and tune random forest.
ctrl = makeTuneControlRandom(maxit = 10L) rdesc = makeResampleDesc("CV", iters = 3L) res = tuneParams(classif_lrn, task = classif_task, resampling = rdesc, par.set = params, control = ctrl, measures = list(acc), show.info = FALSE) res ## Tune result: ## Op. pars: mtry=49; ntree=464 ## acc.test.mean=0.9707409
As a result of tuning, we have obtained hyperparameters mtry=49
, ntree=464
.
More
If you would like to learn more about mlr, you can visit mlr webpage.
What is more, a new version of this package is coming up. I highly recommended the following information about mlr3.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.