Site icon R-bloggers

Parallel benchmarking with OpenML and mlr

[This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With this post I want to show you how to benchmark several learners (or learners with different parameter settings) using several data sets in a structured and parallelized fashion. For this we want to use batchtools.

The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.

< !--more-->

If you have a small project and don’t need to parallelize, you might want to just look at the previous blog post called mlr loves OpenML.

The following packages are needed for this:

library("OpenML")
library("mlr")
library("batchtools")
library("ggplot2")

Now we download five OpenML-tasks from OpenML:

set.seed(2017)

## get useful tasks
task_infos = listOMLTasks(tag = "study_14")

## take a sample of 5 tasks from these
task_ids = sample(task_infos$task.id, size = 5)
tasks = lapply(task_ids, getOMLTask)

In a next step we need to create the so called registry. What this basically does is to create a folder with a certain subfolder structure.

## create the experiment registry
reg = makeExperimentRegistry(
  file.dir = "parallel_benchmarking_blogpost",
  packages= c("mlr", "OpenML", "party"),
  seed = 123)
names(reg)
reg$cluster.functions

## allow for parallel computing, for other options see ?makeClusterFunctions
reg$cluster.functions = makeClusterFunctionsMulticore()

Now you should have a new folder in your working directory with the name parallel_benchmarking_blogpost and the following subfolders / files:

parallel_benchmarking_blogpost/
├── algorithms
├── exports
├── external
├── jobs
├── logs
├── problems
├── registry.rds
├── results
└── updates

In the next step we get to the interesting point. We need to define…

## add the problem, in our case the tasks from OpenML
for(task in tasks) {
  addProblem(name = paste("omltask", task$task.id, sep = "_"), data = task)
}

##' Function that takes the task (data) and the learner, runs the learner on
##' the task, uploads the run and returns the run ID.
##'
##' @param job required argument for addAlgorithm
##' @param instance required argument for addAlgorithm
##' @param data the task
##' @param learner the string that defines the learner, see listLearners()
runTask_uploadRun = function(job, instance, data, learner, ...) {

  learner = makeLearner(learner, par.vals = list(...))
  run = runTaskMlr(data, learner)

  run_id = uploadOMLRun(run, tag = "test", confirm.upload = FALSE)
  return(run_id)

}

## add the algorithm
addAlgorithm(name = "mlr", fun = runTask_uploadRun)

## what versions of the algorithm do we want to compute
algo.design = list(mlr = expand.grid(
  learner = "classif.ctree",
  testtype = c("Bonferroni", "Univariate"),
  stump = c(FALSE, TRUE),
  stringsAsFactors = FALSE))
algo.design$mlr

addExperiments(algo.designs = algo.design, repls = 1)

## get an overview of what we will submit
summarizeExperiments()

Now we can simply run our experiment:

submitJobs()

While your job is running, you can check the progress using getStatus(). As soon as getStatus() tells us that all our runs are done, we can collect the results of our experiment from OpenML. To be able to do this we need to collect the run IDs from the uploaded runs we did during the experiment. Also we want to add the info of the parameters used (getJobPars()).

results0 = reduceResultsDataTable()
job.pars = getJobPars()
results = cbind(run.id = results0$V1, job.pars)
results
##      run.id job.id       problem algorithm       learner   testtype
##  1: 1852889      1 omltask_34536       mlr classif.ctree Bonferroni
##  2: 1852882      2 omltask_34536       mlr classif.ctree Univariate
##  3: 1852888      3 omltask_34536       mlr classif.ctree Bonferroni
##  4: 1852885      4 omltask_34536       mlr classif.ctree Univariate
##  5: 1852883      5  omltask_3918       mlr classif.ctree Bonferroni
##  6: 1852884      6  omltask_3918       mlr classif.ctree Univariate
##  7: 1852886      7  omltask_3918       mlr classif.ctree Bonferroni
##  8: 1852887      8  omltask_3918       mlr classif.ctree Univariate
##  9: 1852895      9  omltask_3891       mlr classif.ctree Bonferroni
## 10: 1852897     10  omltask_3891       mlr classif.ctree Univariate
## 11: 1852890     11  omltask_3891       mlr classif.ctree Bonferroni
## 12: 1852891     12  omltask_3891       mlr classif.ctree Univariate
## 13: 1852892     13  omltask_2074       mlr classif.ctree Bonferroni
## 14: 1852896     14  omltask_2074       mlr classif.ctree Univariate
## 15: 1852893     15  omltask_2074       mlr classif.ctree Bonferroni
## 16: 1852894     16  omltask_2074       mlr classif.ctree Univariate
## 17: 1852900     17  omltask_9976       mlr classif.ctree Bonferroni
## 18: 1852901     18  omltask_9976       mlr classif.ctree Univariate
## 19: 1852898     19  omltask_9976       mlr classif.ctree Bonferroni
## 20: 1852899     20  omltask_9976       mlr classif.ctree Univariate
##     stump
##  1: FALSE
##  2: FALSE
##  3:  TRUE
##  4:  TRUE
##  5: FALSE
##  6: FALSE
##  7:  TRUE
##  8:  TRUE
##  9: FALSE
## 10: FALSE
## 11:  TRUE
## 12:  TRUE
## 13: FALSE
## 14: FALSE
## 15:  TRUE
## 16:  TRUE
## 17: FALSE
## 18: FALSE
## 19:  TRUE
## 20:  TRUE

With the run ID information we can now grab the evaluations from OpenML and plot for example the parameter settings against the predictive accuracy.

run.evals0 = listOMLRunEvaluations(run.id = results$run.id)
## Downloading from 'http://www.openml.org/api/v1/json/evaluation/list/run/1852889,1852882,1852888,1852885,1852883,1852884,1852886,1852887,1852895,1852897,1852890,1852891,1852892,1852896,1852893,1852894,1852900,1852901,1852898,1852899' to '<mem>'.
run.evals = merge(results, run.evals0, by = "run.id")

ggplot(run.evals, aes(
  x = interaction(testtype, stump),
  y = predictive.accuracy,
  group = data.name,
  color = interaction(task.id, data.name))) +
  geom_point() + geom_line()

We see that the only data set where a stump is good enough is the pc1 data set. For the madelon data set Bonferroni correction helps. For the others it does not seem to matter. You can check out the results online by going to the task websites (e.g. for task 9976 for the madelon data set go to openml.org/t/9976) or the run websites (e.g. openml.org/r/1852889).

To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.