Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Scope
Increasingly large data sets and search spaces make hyperparameter optimization a time-consuming task. Hyperband (Li et al. 2018) solves this by approximating the performance of a configuration on a simplified version of the problem such as a small subset of the training data, with just a few training epochs in a neural network, or with only a small number of iterations in a gradient-boosting model. After starting randomly sampled configurations, Hyperband iteratively allocates more resources to promising configurations and terminates low-performing ones. This type of optimization is called multi-fidelity optimization. The fidelity parameter is part of the search space and controls the tradeoff between the runtime and accuracy of the performance approximation. In this post, we will optimize XGBoost and use the number of boosting iterations as the fidelity parameter. This means Hyperband will allocate more boosting iterations to well-performing configurations. The number of boosting iterations increases the time to train a model and improves the performance until the model is overfitting to the training data. It is therefore a suitable fidelity parameter. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. This is the first part of the Hyperband series. The second part can be found here Hyperband Series – Data Set Subsampling.
< section id="hyperband" class="level1 page-columns page-full">Hyperband
Hyperband is an advancement of the Successive Halving algorithm by Jamieson and Talwalkar (2016). Successive Halving is initialized with the number of starting configurations
Hyperband solves this problem by running Successive Halving with different numbers of stating configurations. The algorithm is initialized with the same parameters as Successive Halving but without
The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. The colors in Figure 1 indicate batches that are evaluated in parallel.
< section id="hyperparameter-optimization" class="level1">Hyperparameter Optimization
In this practical example, we will optimize the hyperparameters of XGBoost on the Spam
data set. We begin by loading the XGBoost learner.
.
library("mlr3verse") learner = lrn("classif.xgboost")
The next thing we do is define the search space. The nrounds
parameter controls the number of boosting iterations. We set a range from 16 to 128 boosting iterations. This is used as "budget"
to identify it as a fidelity parameter. For the other hyperparameters, we take the search space for XGBoost from the Bischl et al. (2021) article. This search space works for a wide range of data sets.
learner$param_set$set_values( nrounds = to_tune(p_int(16, 128, tags = "budget")), eta = to_tune(1e-4, 1, logscale = TRUE), max_depth = to_tune(1, 20), colsample_bytree = to_tune(1e-1, 1), colsample_bylevel = to_tune(1e-1, 1), lambda = to_tune(1e-3, 1e3, logscale = TRUE), alpha = to_tune(1e-3, 1e3, logscale = TRUE), subsample = to_tune(1e-1, 1) )
We construct the tuning instance. We use the "none"
terminator because Hyperband terminates itself when all brackets are evaluated.
instance = ti( task = tsk("spam"), learner = learner, resampling = rsmp("holdout"), measures = msr("classif.ce"), terminator = trm("none") ) instance
<TuningInstanceSingleCrit> * State: Not optimized * Objective: <ObjectiveTuning:classif.xgboost_on_spam> * Search Space: id class lower upper nlevels 1: nrounds ParamInt 16.000000 128.000000 113 2: eta ParamDbl -9.210340 0.000000 Inf 3: max_depth ParamInt 1.000000 20.000000 20 4: colsample_bytree ParamDbl 0.100000 1.000000 Inf 5: colsample_bylevel ParamDbl 0.100000 1.000000 Inf 6: lambda ParamDbl -6.907755 6.907755 Inf 7: alpha ParamDbl -6.907755 6.907755 Inf 8: subsample ParamDbl 0.100000 1.000000 Inf * Terminator: <TerminatorNone>
We load the Hyperband tuner
and set eta = 2
. Hyperband can start from the beginning when the last bracket is evaluated. We control the number of Hyperband runs with the repetition
argument. The setting repetition = Inf
is useful when a terminator should stop the optimization.
library("mlr3hyperband") tuner = tnr("hyperband", eta = 2, repetitions = 1)
The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. You can think of it as going diagonally through Figure 1. Using eta = 2
and a range from 16 to 128 boosting iterations results in the following schedule.
Now we are ready to start the tuning.
tuner$optimize(instance)
The result of a run is the configuration with the best performance. This does not necessarily have to be a configuration evaluated with the highest budget since we can overfit the data with too many boosting iterations.
instance$result[, .(nrounds, eta, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample)]
nrounds eta max_depth colsample_bytree colsample_bylevel lambda alpha subsample 1: 128 -0.4334209 20 0.1574264 0.2886485 -1.333902 -3.394965 0.764349
The archive of a Hyperband run has the additional columns "bracket"
and "stage"
.
as.data.table(instance$archive)[, .(bracket, stage, classif.ce, eta, max_depth, colsample_bytree)]
bracket stage classif.ce eta max_depth colsample_bytree 1: 3 0 0.06518905 -7.0150617 9 0.2885488 2: 3 0 0.23859192 -2.5492834 17 0.2052036 3: 3 0 0.33898305 -9.1773946 6 0.3447989 4: 3 0 0.07692308 -2.1745616 12 0.1800334 5: 3 0 0.28617992 -6.5822516 1 0.5811652 --- 31: 0 0 0.51760104 -5.2073558 4 0.3352148 32: 3 3 0.05019557 -0.8928307 6 0.4666395 33: 2 2 0.05541069 -8.0585126 15 0.6845195 34: 1 1 0.08018253 -8.3130313 7 0.7767661 35: 1 1 0.08279009 -5.5293970 7 0.5714851
Conclusion
The handling of Hyperband in mlr3tuning is very similar to that of other tuners. We only have to select an additional fidelity parameter and tag it with "budget"
. We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and increase the maximum number of boosting rounds. The Bischl et al. (2021) search space suggests 5000 boosting rounds. Check out our next post on Hyperband which uses the size of the training data set as the fidelity parameter.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.