Bayesian Model Based Optimization in R

mike

1 year ago

[This article was first published on R – Data Science, Insurance, Bikes, and the Meaning of Life, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Model-based optimization (MBO) is a smart approach to tuning the hyperparameters of machine learning algorithms with less CPU time and manual effort than standard grid search approaches. The core idea behind MBO is to directly evaluate fewer points within a hyperparameter space, and to instead use a “surrogate model” which estimates what the result of your objective function would be in new locations by interpolating (not linearly) between the observed results in a small sample of initial evaluations. Many methods can be used to construct the surrogate model. This post will focus on implementing the bayesian method of Gaussian Process (GP) smoothing (aka “kriging”) which is borrowed from – and particularly well-suited to – spatial applications.

Background

I remember when I started using machine learning methods how time consuming and – even worse – manual it could be to perform a hyperparameter search. The whole benefit of machine learning is that the algorithm should optimize the model-learning task for us, right? The problem of course becomes one of compute resources. Suppose we only have simple brute-force grid search at our disposal. With just one hyperparameter to tune, this approach is practical – we may only need to test 5 candidate values. But as the number of hyperparameters (“dimension”) increments, the number of candidate hyperparameterizations increases according to a power function. Suppose instead we have 5 hyperparameters to tune – again using just points five points for each dimension would now result in \(5^5 = 3,125\) model evaluations to test all the possible combinations. Sometimes 5 points is realistic – for example, with some discrete parameters like the maximum tree depth in a random forest. But for something continuous it is usually not, so I am really understating how quickly a grid will blow up, making brute-force approaches impractical.

Pretty quickly one goes from the brute-force appraoch to more involved strategies for grid search. That could mean starting with coarse grids and zooming into specific promising areas with higher resolution grids in subsequent searches; it could mean iterating between two or three different subsets of the hyperparameters which tend to “move together” – like the learning rate and number of rounds in a GBM. These strategies become highly manual, and frankly it becomes a real effort to keep track of the different runs and results. We don't want to have to think this much and risk making a mistake when tuning algorithms!

Model-Based Optimization

MBO differs from grid search in a couple of ways. First, we search the entire continuous range of a hyperparameter, not a discretized set of points within that range. Second, and more importantly, it is a probabilistic method which uses information from early evaluations to improve the selection of subsequent tests that will be run. In this regard it is similar to the low-res/high-res search strategy, but with automation. As good Bayesians, we like methods that incorporate prior information to improve later decisions, a principle which is intuitive and appealing to our naturally bayesian brains.

As mentioned above, the method for selecting later test points based on the information from the early tests is gaussian process smoothing or kriging. One popular application for Gaussian processes is in geo-spatial smoothing and regression. We are basically doing the same thing here, except instead of geographic (lat-long) space, our space is defined by the ranges of a set of hyperparameters. We refer to this as the hyperparameter space, and MBO is going to help us search it for the point which provides the optimal result of a machine learning algorithm.

So let's take a look at how Bayes helps us tune machine learning algorithms with some code.

Demonstration

Environment

The main package we need is mlrMBO, which provides the mbo() method for optimizing an arbitrary function sequentially. We also need several others for various helpers – smoof to define the objective function which will be optimized; ParamHelpers to define a parameter space in which we will perform the bayesian search for a global optimum; and DiceKriging provides the gaussian process interpolation (in the machine learning world it is called “kriging”) capability.

We will use the xgboost flavor of GBM as our machine learning methodology to be tuned, but you could adapt what I'm demonstrating here to any algorithm with multiple hyperparameters (or even a single one, if run-time for a single iteration was so high as to warrant it). mlrMBO is completely agnostic to your choice of methodology, but the flip side is this means a bit of coding setup required on the data scientist's part (good thing we like coding, and don't like manual work).

library(CASdatasets)
library(dplyr)
library(tibble)
library(magrittr)
library(ggplot2)
library(scatterplot3d)
library(kableExtra)
library(tidyr)
library(mlrMBO)
library(ParamHelpers)
library(DiceKriging)
library(smoof)
library(xgboost)

Data

I'll use my go-to insurance ratemaking dataset for demonstration purposes – the french motor dataset from CASdatasets.

data("freMPL1")
data("freMPL2")
data("freMPL3")
fre_df < - rbind(freMPL1, freMPL2, freMPL3 %>% select(-DeducType))
rm(freMPL1, freMPL2, freMPL3)

Let's take a look at our target variable ClaimAmount

gridExtra::grid.arrange(
  fre_df %>%
    filter(ClaimAmount > 0) %>%
    ggplot(aes(x = ClaimAmount)) +
    geom_density() +
    ggtitle("Observed Loss Distribution"),

  fre_df %>%
    filter(ClaimAmount > 0, ClaimAmount <  1.5e4) %>%
    ggplot(aes(x = ClaimAmount)) +
    geom_density() +
    ggtitle("Observed Severity Distribution"),
  nrow = 1
)

We have something like a compound distribution – a probability mass at 0, and some long-tailed distribution of loss dollars for observations with incurred claims. But let's also look beyond the smoothed graphical view.

min(fre_df$ClaimAmount)

## [1] -3407.7

sum(fre_df$ClaimAmount < 0)

## [1] 690

We also appear to have some claims < 0 - perhaps recoveries (vehicle salvage) exceeded payments. For the sake of focusing on the MBO, we will adjust these records by flooring values at 0. I'll also convert some factor columns to numeric types which make more sense for modeling.

fre_df %%
  mutate(ClaimAmount = case_when(ClaimAmount <  0 ~ 0, TRUE ~ ClaimAmount)) %>%
  mutate(VehMaxSpeed_num = sub(".*-", "", VehMaxSpeed) %>% substr(., 1, 3)%>% as.numeric,
         VehAge_num = sub("*.-", "", VehAge) %>% sub('\\+', '', .) %>% as.numeric,
         VehPrice_num = as.integer(VehPrice)) %>% # The factor levels appear to be ordered so I will use this
  group_by(SocioCateg) %>% # high cardinality, will encode as a proportion of total
  mutate(SocioCateg_prop =  (sum(n()) / 4) / nrow(.) * 1e5) %>% 
  ungroup()

## matrices, no intercept needed and don't forget to exclude post-dictors
fre_mat <- model.matrix(ClaimAmount ~ . -1 -ClaimInd -Exposure -RecordBeg 
                        -RecordEnd - VehMaxSpeed -VehPrice -VehAge -SocioCateg,
                        data = fre_df)
## xgb.DMatrix, faster sparse matrix
fre_dm <- xgb.DMatrix(data = fre_mat, 
                      label = fre_df$ClaimAmount, 
                      base_margin = log(fre_df$Exposure)) ## base-margin == offset
                                                          ## we use log earned exposure because the xgboost Tweedie
                                                          ## implementation includes a log-link for the variance power

Objective function for optimizing

To avoid confusion, there are two objective functions we could refer to. Statistically, our objective function aka our loss funciton is negative log-likelihood for am assumed tweeedie-distributed random variable. The xgboost algorithm will minimize this objective (equivalent to maximizing likelihood) for a given set of hyper-parameters for each run. Our other objective function is the R function defined below - it calls xgb.cv(), runs the learning procedure with cross-validation, stops when the out-of-fold likelihood does not improve, and returns the best objective evaluation (log-loss metric) based on the out-of-fold samples.

Note that the function below also includes a defined hyperparameter space - a set of tuning parameters with possible ranges for values. There are 6 traditional tuning parameters for xgboost, but I've also added the tweedie variance “power” parameter as a seventh. This parameter would take a value between (1,2) for a poisson-gamma compound distribution, but I first narrowed this down to a smaller range based on a quick profile of the loss distribution (using tweedie::tweedie.profile(), omitted here).

# Adapted for Tweedie likelihood from this very good post at https://www.simoncoulombe.com/2019/01/bayesian/
# objective function: we want to minimize the neg log-likelihood by tuning hyperparameters
obj.fun <- makeSingleObjectiveFunction(
  name = "xgb_cv_bayes",
  fn =   function(x){
    set.seed(42)
    cv <- xgb.cv(params = list(
      booster          = "gbtree",
      eta              = x["eta"],
      max_depth        = x["max_depth"],
      min_child_weight = x["min_child_weight"],
      gamma            = x["gamma"],
      subsample        = x["subsample"],
      colsample_bytree = x["colsample_bytree"],
      max_delta_step   = x["max_delta_step"],
      tweedie_variance_power = x["tweedie_variance_power"],
      objective        = 'reg:tweedie', 
      eval_metric     = paste0("tweedie-nloglik@", x["tweedie_variance_power"])),
      data = dm, ## must set in global.Env()
      nround = 7000, ## Set this large and use early stopping
      nthread = 26, ## Adjust based on your machine
      nfold =  5,
      prediction = FALSE,
      showsd = TRUE,
      early_stopping_rounds = 25, ## If evaluation metric does not improve on out-of-fold sample for 25 rounds, stop
      verbose = 1,
      print_every_n = 500)

    cv$evaluation_log %>% pull(4) %>% min  ## column 4 is the eval metric here, tweedie negative log-likelihood
  },
  par.set = makeParamSet(
    makeNumericParam("eta",                    lower = 0.005, upper = 0.01),
    makeNumericParam("gamma",                  lower = 1,     upper = 5),
    makeIntegerParam("max_depth",              lower= 2,      upper = 10),
    makeIntegerParam("min_child_weight",       lower= 300,    upper = 2000),
    makeNumericParam("subsample",              lower = 0.20,  upper = .8),
    makeNumericParam("colsample_bytree",       lower = 0.20,  upper = .8),
    makeNumericParam("max_delta_step",         lower = 0,     upper = 5),
    makeNumericParam("tweedie_variance_power", lower = 1.75,   upper = 1.85)
  ),
  minimize = TRUE ## negative log likelihood
)

A function which runs the optimization

The core piece here is the call to mbo(). This accepts an initial design - i.e. a set of locations which are chosen to be “space-filling” within our hyperparameter space (we do not want randomn generation which could result in areas of the space having no points nearby) - created using ParamHelpers::generateDesign(). The makeMBOControl() method is used to create an object which will simply tell mbo() how many optimization steps to run after the intial design is tested - these are the runs which are determined probabilistically through gaussian process smoothing, aka kriging. Finally, I create a plot of the optimization path and return the objects in a list for later use.

The covariance structure used in the gaussian process is what makes GPs “bayesian” - they define the prior information as a function of nearby observed values and the covariance structure which defines the level of smoothness expected. We use a Matern 3/2 kernel - this is a moderately smooth covariance often used in geospatial applications and which is well-suited to our own spatial task. It is equivalent to the product of an exponential and a polynomial of degree 1. This is the mbo default for a numerical hyperparameter space - if your hyperparameters include some which are non-numeric (for example, you may have a hyperparameter for “method” and a set of methods to choose from), then instead of kriging a random forest is used to estimate the value of the objective function between points, and from this the optimizing proposals are chosen. This would no longer be a strictly “bayesian” approach, though I think it would still be bayesian in spirit.

The gaussian process models the result of our objective function's output as a function of hyperparameter values, using the initial design samples. For this reason, it is referred to (especially in the deep learning community) as a surrogate model - it serves as a cheap surrogate for running another evaluation of our objective function at some new point. For any point not evaluated directly, the estimated/interpolated surface provides an expectation. This benefits us because points that are likely to perform poorly (based on the surrogate model estimate) will be discarded, and we will only move on with directly evaluating points in promising regions of the hyperparameter space.

Creating a wrapper function is optional - but to perform multiple runs in an analysis, most of the code here would need to be repeated. To be concise, I write it once so it can be called for subsequent runs (perhaps on other datasets, or if you get back a boundary solution you did not anticipate).

do_bayes <- function(n_design = NULL, opt_steps = NULL, of = obj.fun, seed = 42) {
  set.seed(seed)

  des <- generateDesign(n=n_design,
                        par.set = getParamSet(of),
                        fun = lhs::randomLHS)

  control < - makeMBOControl() %>%
    setMBOControlTermination(., iters = opt_steps)

  ## kriging with a matern(3,2) covariance function is the default surrogate model for numerical domains
  ## but if you wanted to override this you could modify the makeLearner() call below to define your own
  ## GP surrogate model with more or lesss smoothness, or use an entirely different method
  run <- mbo(fun = of,
             design = des,
             learner = makeLearner("regr.km", predict.type = "se", covtype = "matern3_2", control = list(trace = FALSE)),
             control = control, 
             show.info = TRUE)

  opt_plot < - run$opt.path$env$path %>%
    mutate(Round = row_number()) %>%
    mutate(type = case_when(Round

Number of evaluations

Normally for this problem I would perform more evaluations, in both the intial and optimizing phases. Something around 5 -7 times the number of parameters being tuned for the initial design and half of that for the number of optimization steps could be a rule of thumb. You need some points in the space to have something to interpolate between!

Here's my intial design of 15 points.

des <- generateDesign(n=15,
                      par.set = getParamSet(obj.fun),
                      fun = lhs::randomLHS)

kable(des, format = "html", digits = 4) %>% 
  kable_styling(_size = 10) %>%
  kable_material_dark()

eta	gamma	max_depth	min_child_weight	subsample	colsample_bytree	max_delta_step	tweedie_variance_power
0.0064	1.7337	5	1259	0.3795	0.3166	4.0727	1.8245
0.0074	3.2956	6	1073	0.4897	0.2119	1.0432	1.7833
0.0062	1.4628	9	1451	0.7799	0.5837	3.0967	1.8219
0.0059	4.5578	3	703	0.2135	0.3206	1.6347	1.8346
0.0081	4.0202	8	445	0.4407	0.5495	2.2957	1.7879
0.0067	3.5629	3	330	0.4207	0.4391	0.9399	1.7998
0.0092	4.4374	7	1937	0.6411	0.7922	3.9315	1.7704
0.0099	1.8265	7	1595	0.2960	0.3961	2.5070	1.8399
0.0055	2.4464	9	805	0.5314	0.2457	3.6427	1.7624
0.0096	2.2831	10	1662	0.6873	0.6075	0.2518	1.8457
0.0079	2.8737	4	894	0.2790	0.4954	0.4517	1.7965
0.0087	2.8600	6	599	0.5717	0.6537	4.9145	1.7557
0.0053	1.0082	2	1863	0.6021	0.7436	2.7048	1.7686
0.0071	4.9817	2	1380	0.3599	0.4507	4.3572	1.8156
0.0086	3.6847	9	1124	0.7373	0.6935	1.9759	1.8043

And here is a view to see how well those points fill out 3 of the 7 dimensions

scatterplot3d(des$eta, des$gamma, des$min_child_weight,
              type = "h", color = "blue", pch = 16)

We can see large areas with no nearby points - if the global optimum lies here, we may still end up with proposals in this area that lead us to find it, but it sure would be helpful to gather some info there and guarantee it. Here's a better design with 6 points per hyperparameter.

des <- generateDesign(n=42,
                      par.set = getParamSet(obj.fun),
                      fun = lhs::randomLHS)
scatterplot3d(des$eta, des$gamma, des$min_child_weight,
              type = "h", color = "blue", pch = 16)

This would take longer to run, but we will rely less heavily on interpolation over long distances during the optimizing phase because we have more information observed through experiments. Choosing your design is about the trade-off between desired accuracy and computational expense. So use as many points in the initial design as you can afford time for (aiming for at least 5-7 per parameter), and maybe half as many for the number of subsequent optimization steps.

Do Bayes!

Now that we are all set up, let's run the procedure using our do_bayes() function above and evaluate the result. As discussed above, I recommend sizing your random design and optimization steps according to the size of your hyperparameter space, using 5-7 points per hyperparameter as a rule of thumb. You can also figure out roughly how much time a single evaluation take (which will depend on the hyperparameter values, so this should be an estimate of the mean time), as well as how much time you can budget, and then choose the values that work for you. Here I use 25 total runs - 15 initial evaluations, and 10 optimization steps.

(Note: The verbose output for each evaluation is shown below for your interest)

dm <- fre_dm
runs <- do_bayes(n_design = 15, of = obj.fun, opt_steps = 10, seed = 42)

## Computing y column(s) for design. Not provided.

## [1]  train-tweedie-nloglik@1.82975:342.947351+9.068011   test-tweedie-nloglik@1.82975:342.950897+36.281204 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.82975 for early stopping.
## Will train until test_tweedie_nloglik@1.82975 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.82975:54.102369+1.254521    test-tweedie-nloglik@1.82975:54.103011+5.019251 
## [1001]   train-tweedie-nloglik@1.82975:17.471831+0.173606    test-tweedie-nloglik@1.82975:17.486700+0.699454 
## [1501]   train-tweedie-nloglik@1.82975:15.143893+0.056641    test-tweedie-nloglik@1.82975:15.371196+0.251584 
## [2001]   train-tweedie-nloglik@1.82975:14.794773+0.047399    test-tweedie-nloglik@1.82975:15.161293+0.229903 
## [2501]   train-tweedie-nloglik@1.82975:14.583392+0.051122    test-tweedie-nloglik@1.82975:15.069248+0.245231 
## [3001]   train-tweedie-nloglik@1.82975:14.419826+0.051263    test-tweedie-nloglik@1.82975:14.996046+0.258229 
## [3501]   train-tweedie-nloglik@1.82975:14.281899+0.049542    test-tweedie-nloglik@1.82975:14.944954+0.278042 
## [4001]   train-tweedie-nloglik@1.82975:14.162422+0.046876    test-tweedie-nloglik@1.82975:14.902480+0.299742 
## [4501]   train-tweedie-nloglik@1.82975:14.056329+0.045482    test-tweedie-nloglik@1.82975:14.861813+0.318562 
## Stopping. Best iteration:
## [4597]   train-tweedie-nloglik@1.82975:14.037272+0.045602    test-tweedie-nloglik@1.82975:14.852832+0.320272
## 
## [1]  train-tweedie-nloglik@1.78053:341.751507+9.223267   test-tweedie-nloglik@1.78053:341.757257+36.900142 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78053 for early stopping.
## Will train until test_tweedie_nloglik@1.78053 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.78053:20.067214+0.271193    test-tweedie-nloglik@1.78053:20.130995+1.160850 
## [1001]   train-tweedie-nloglik@1.78053:15.076158+0.069745    test-tweedie-nloglik@1.78053:15.599540+0.313749 
## [1501]   train-tweedie-nloglik@1.78053:14.517137+0.066595    test-tweedie-nloglik@1.78053:15.309150+0.339692 
## [2001]   train-tweedie-nloglik@1.78053:14.163563+0.061668    test-tweedie-nloglik@1.78053:15.147976+0.382249 
## [2501]   train-tweedie-nloglik@1.78053:13.890958+0.068184    test-tweedie-nloglik@1.78053:15.034968+0.404933 
## [3001]   train-tweedie-nloglik@1.78053:13.663806+0.063876    test-tweedie-nloglik@1.78053:14.950204+0.428143 
## [3501]   train-tweedie-nloglik@1.78053:13.467250+0.063427    test-tweedie-nloglik@1.78053:14.885284+0.460394 
## [4001]   train-tweedie-nloglik@1.78053:13.293906+0.060834    test-tweedie-nloglik@1.78053:14.837956+0.493384 
## Stopping. Best iteration:
## [4190]   train-tweedie-nloglik@1.78053:13.231073+0.059948    test-tweedie-nloglik@1.78053:14.818203+0.503775
## 
## [1]  train-tweedie-nloglik@1.8352:342.471167+9.032795    test-tweedie-nloglik@1.8352:342.480517+36.144871 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8352 for early stopping.
## Will train until test_tweedie_nloglik@1.8352 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.8352:28.357122+0.477149 test-tweedie-nloglik@1.8352:28.895080+2.147141 
## [1001]   train-tweedie-nloglik@1.8352:14.938225+0.058305 test-tweedie-nloglik@1.8352:15.482690+0.376248 
## [1501]   train-tweedie-nloglik@1.8352:14.236916+0.048579 test-tweedie-nloglik@1.8352:14.920337+0.307540 
## [2001]   train-tweedie-nloglik@1.8352:13.941143+0.047951 test-tweedie-nloglik@1.8352:14.796813+0.353247 
## [2501]   train-tweedie-nloglik@1.8352:13.719723+0.047280 test-tweedie-nloglik@1.8352:14.722402+0.396149 
## [3001]   train-tweedie-nloglik@1.8352:13.536447+0.045626 test-tweedie-nloglik@1.8352:14.666037+0.429260 
## Stopping. Best iteration:
## [3171]   train-tweedie-nloglik@1.8352:13.480152+0.046018 test-tweedie-nloglik@1.8352:14.652766+0.442854
## 
## [1]  train-tweedie-nloglik@1.80549:340.990906+9.109456   test-tweedie-nloglik@1.80549:340.997974+36.447143 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80549 for early stopping.
## Will train until test_tweedie_nloglik@1.80549 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80549:17.171271+0.149287    test-tweedie-nloglik@1.80549:17.612757+0.749392 
## [1001]   train-tweedie-nloglik@1.80549:14.128140+0.059918    test-tweedie-nloglik@1.80549:14.910039+0.330671 
## [1501]   train-tweedie-nloglik@1.80549:13.571417+0.056644    test-tweedie-nloglik@1.80549:14.694386+0.417752 
## Stopping. Best iteration:
## [1847]   train-tweedie-nloglik@1.80549:13.292618+0.053193    test-tweedie-nloglik@1.80549:14.617792+0.481956
## 
## [1]  train-tweedie-nloglik@1.77453:340.924103+9.220978   test-tweedie-nloglik@1.77453:340.936652+36.904872 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.77453 for early stopping.
## Will train until test_tweedie_nloglik@1.77453 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.77453:15.810031+0.105204    test-tweedie-nloglik@1.77453:16.556760+0.587635 
## [1001]   train-tweedie-nloglik@1.77453:13.707083+0.061060    test-tweedie-nloglik@1.77453:14.973095+0.458299 
## [1501]   train-tweedie-nloglik@1.77453:13.042243+0.062065    test-tweedie-nloglik@1.77453:14.767002+0.587105 
## Stopping. Best iteration:
## [1888]   train-tweedie-nloglik@1.77453:12.664030+0.058594    test-tweedie-nloglik@1.77453:14.693107+0.666969
## 
## [1]  train-tweedie-nloglik@1.81211:341.573523+9.099887   test-tweedie-nloglik@1.81211:341.579096+36.409106 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81211 for early stopping.
## Will train until test_tweedie_nloglik@1.81211 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.81211:21.698555+0.287551    test-tweedie-nloglik@1.81211:21.802563+1.318389 
## [1001]   train-tweedie-nloglik@1.81211:15.316608+0.057473    test-tweedie-nloglik@1.81211:15.507609+0.267001 
## [1501]   train-tweedie-nloglik@1.81211:15.002261+0.055740    test-tweedie-nloglik@1.81211:15.310712+0.234783 
## [2001]   train-tweedie-nloglik@1.81211:14.813123+0.056066    test-tweedie-nloglik@1.81211:15.225331+0.251187 
## [2501]   train-tweedie-nloglik@1.81211:14.666221+0.055383    test-tweedie-nloglik@1.81211:15.168456+0.269639 
## [3001]   train-tweedie-nloglik@1.81211:14.541463+0.053780    test-tweedie-nloglik@1.81211:15.124595+0.291749 
## [3501]   train-tweedie-nloglik@1.81211:14.435767+0.053397    test-tweedie-nloglik@1.81211:15.090134+0.312649 
## [4001]   train-tweedie-nloglik@1.81211:14.342162+0.053199    test-tweedie-nloglik@1.81211:15.060593+0.328036 
## [4501]   train-tweedie-nloglik@1.81211:14.255980+0.053683    test-tweedie-nloglik@1.81211:15.030732+0.343309 
## Stopping. Best iteration:
## [4585]   train-tweedie-nloglik@1.81211:14.242503+0.053874    test-tweedie-nloglik@1.81211:15.024390+0.345340
## 
## [1]  train-tweedie-nloglik@1.81758:341.854852+9.086251   test-tweedie-nloglik@1.81758:341.863275+36.356180 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81758 for early stopping.
## Will train until test_tweedie_nloglik@1.81758 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.81758:25.013295+0.397345    test-tweedie-nloglik@1.81758:25.676643+1.808151 
## [1001]   train-tweedie-nloglik@1.81758:14.504071+0.055626    test-tweedie-nloglik@1.81758:15.187462+0.380966 
## [1501]   train-tweedie-nloglik@1.81758:13.875887+0.049737    test-tweedie-nloglik@1.81758:14.765052+0.371576 
## [2001]   train-tweedie-nloglik@1.81758:13.542809+0.049598    test-tweedie-nloglik@1.81758:14.638334+0.423159 
## Stopping. Best iteration:
## [2270]   train-tweedie-nloglik@1.81758:13.398557+0.049952    test-tweedie-nloglik@1.81758:14.601097+0.448479
## 
## [1]  train-tweedie-nloglik@1.78506:341.348248+9.195115   test-tweedie-nloglik@1.78506:341.357587+36.789846 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78506 for early stopping.
## Will train until test_tweedie_nloglik@1.78506 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.78506:18.174807+0.195105    test-tweedie-nloglik@1.78506:19.060744+1.022248 
## [1001]   train-tweedie-nloglik@1.78506:13.397231+0.063749    test-tweedie-nloglik@1.78506:14.732965+0.451939 
## Stopping. Best iteration:
## [1460]   train-tweedie-nloglik@1.78506:12.673491+0.062628    test-tweedie-nloglik@1.78506:14.487809+0.572331
## 
## [1]  train-tweedie-nloglik@1.84189:341.976398+8.988454   test-tweedie-nloglik@1.84189:341.984406+35.983697 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84189 for early stopping.
## Will train until test_tweedie_nloglik@1.84189 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.84189:18.400557+0.165181    test-tweedie-nloglik@1.84189:18.735286+0.850735 
## [1001]   train-tweedie-nloglik@1.84189:14.588454+0.047146    test-tweedie-nloglik@1.84189:15.120865+0.271691 
## [1501]   train-tweedie-nloglik@1.84189:14.106823+0.046601    test-tweedie-nloglik@1.84189:14.891420+0.316660 
## Stopping. Best iteration:
## [1936]   train-tweedie-nloglik@1.84189:13.822906+0.048588    test-tweedie-nloglik@1.84189:14.787369+0.361366
## 
## [1]  train-tweedie-nloglik@1.76821:343.904297+9.327030   test-tweedie-nloglik@1.76821:343.909997+37.314570 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.76821 for early stopping.
## Will train until test_tweedie_nloglik@1.76821 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.76821:153.005731+4.071955   test-tweedie-nloglik@1.76821:153.008331+16.291001 
## [1001]   train-tweedie-nloglik@1.76821:70.474121+1.777711    test-tweedie-nloglik@1.76821:70.475147+7.112157 
## [1501]   train-tweedie-nloglik@1.76821:35.485048+0.776015    test-tweedie-nloglik@1.76821:35.485494+3.104626 
## [2001]   train-tweedie-nloglik@1.76821:21.549418+0.338670    test-tweedie-nloglik@1.76821:21.549704+1.354912 
## [2501]   train-tweedie-nloglik@1.76821:17.189798+0.147854    test-tweedie-nloglik@1.76821:17.210399+0.612728 
## [3001]   train-tweedie-nloglik@1.76821:16.418288+0.094189    test-tweedie-nloglik@1.76821:16.555264+0.433128 
## [3501]   train-tweedie-nloglik@1.76821:16.109250+0.081212    test-tweedie-nloglik@1.76821:16.344176+0.387444 
## [4001]   train-tweedie-nloglik@1.76821:15.925080+0.078790    test-tweedie-nloglik@1.76821:16.233377+0.369088 
## [4501]   train-tweedie-nloglik@1.76821:15.793141+0.077723    test-tweedie-nloglik@1.76821:16.158620+0.370770 
## [5001]   train-tweedie-nloglik@1.76821:15.686279+0.075960    test-tweedie-nloglik@1.76821:16.110775+0.376137 
## [5501]   train-tweedie-nloglik@1.76821:15.597672+0.076042    test-tweedie-nloglik@1.76821:16.076505+0.388222 
## Stopping. Best iteration:
## [5707]   train-tweedie-nloglik@1.76821:15.563175+0.074569    test-tweedie-nloglik@1.76821:16.059571+0.391720
## 
## [1]  train-tweedie-nloglik@1.7599:342.328162+9.312946    test-tweedie-nloglik@1.7599:342.334070+37.259847 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.7599 for early stopping.
## Will train until test_tweedie_nloglik@1.7599 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.7599:21.034394+0.261867 test-tweedie-nloglik@1.7599:21.110715+1.265011 
## [1001]   train-tweedie-nloglik@1.7599:16.515162+0.077981 test-tweedie-nloglik@1.7599:16.665819+0.360601 
## [1501]   train-tweedie-nloglik@1.7599:16.270615+0.075134 test-tweedie-nloglik@1.7599:16.505939+0.319116 
## [2001]   train-tweedie-nloglik@1.7599:16.125761+0.075163 test-tweedie-nloglik@1.7599:16.432678+0.320781 
## [2501]   train-tweedie-nloglik@1.7599:16.014767+0.073620 test-tweedie-nloglik@1.7599:16.385424+0.331610 
## [3001]   train-tweedie-nloglik@1.7599:15.920065+0.071297 test-tweedie-nloglik@1.7599:16.346298+0.346168 
## Stopping. Best iteration:
## [3228]   train-tweedie-nloglik@1.7599:15.881553+0.070426 test-tweedie-nloglik@1.7599:16.329813+0.351423
## 
## [1]  train-tweedie-nloglik@1.80226:341.820270+9.143934   test-tweedie-nloglik@1.80226:341.825208+36.586124 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80226 for early stopping.
## Will train until test_tweedie_nloglik@1.80226 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80226:26.608221+0.443646    test-tweedie-nloglik@1.80226:26.664685+1.977213 
## [1001]   train-tweedie-nloglik@1.80226:15.824172+0.068891    test-tweedie-nloglik@1.80226:15.918101+0.371526 
## [1501]   train-tweedie-nloglik@1.80226:15.499650+0.061893    test-tweedie-nloglik@1.80226:15.647616+0.269262 
## [2001]   train-tweedie-nloglik@1.80226:15.374731+0.061671    test-tweedie-nloglik@1.80226:15.573182+0.258539 
## [2501]   train-tweedie-nloglik@1.80226:15.289750+0.061845    test-tweedie-nloglik@1.80226:15.534944+0.261417 
## Stopping. Best iteration:
## [2579]   train-tweedie-nloglik@1.80226:15.278433+0.061600    test-tweedie-nloglik@1.80226:15.529791+0.262305
## 
## [1]  train-tweedie-nloglik@1.84974:343.394391+8.997226   test-tweedie-nloglik@1.84974:343.399872+36.001197 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84974 for early stopping.
## Will train until test_tweedie_nloglik@1.84974 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.84974:35.638326+0.676807    test-tweedie-nloglik@1.84974:35.855040+2.870666 
## [1001]   train-tweedie-nloglik@1.84974:15.927161+0.074464    test-tweedie-nloglik@1.84974:16.216283+0.425658 
## [1501]   train-tweedie-nloglik@1.84974:14.851506+0.047610    test-tweedie-nloglik@1.84974:15.276635+0.237773 
## [2001]   train-tweedie-nloglik@1.84974:14.524470+0.041982    test-tweedie-nloglik@1.84974:15.112846+0.258227 
## [2501]   train-tweedie-nloglik@1.84974:14.292380+0.045034    test-tweedie-nloglik@1.84974:15.027395+0.291822 
## [3001]   train-tweedie-nloglik@1.84974:14.103252+0.044065    test-tweedie-nloglik@1.84974:14.970938+0.334465 
## Stopping. Best iteration:
## [3384]   train-tweedie-nloglik@1.84974:13.977265+0.046385    test-tweedie-nloglik@1.84974:14.943196+0.365468
## 
## [1]  train-tweedie-nloglik@1.79106:341.213971+9.169863   test-tweedie-nloglik@1.79106:341.220727+36.687101 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79106 for early stopping.
## Will train until test_tweedie_nloglik@1.79106 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.79106:17.835510+0.165372    test-tweedie-nloglik@1.79106:18.048213+0.847644 
## [1001]   train-tweedie-nloglik@1.79106:14.868611+0.068062    test-tweedie-nloglik@1.79106:15.391133+0.315593 
## [1501]   train-tweedie-nloglik@1.79106:14.301947+0.058922    test-tweedie-nloglik@1.79106:15.099987+0.329528 
## [2001]   train-tweedie-nloglik@1.79106:13.923899+0.058205    test-tweedie-nloglik@1.79106:14.931884+0.371135 
## [2501]   train-tweedie-nloglik@1.79106:13.617700+0.062158    test-tweedie-nloglik@1.79106:14.828744+0.421567 
## [3001]   train-tweedie-nloglik@1.79106:13.364167+0.063944    test-tweedie-nloglik@1.79106:14.745604+0.455195 
## Stopping. Best iteration:
## [3159]   train-tweedie-nloglik@1.79106:13.292696+0.064719    test-tweedie-nloglik@1.79106:14.719766+0.463486
## 
## [1]  train-tweedie-nloglik@1.75281:343.292597+9.363249   test-tweedie-nloglik@1.75281:343.301770+37.470216 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.75281 for early stopping.
## Will train until test_tweedie_nloglik@1.75281 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.75281:33.808357+0.652361    test-tweedie-nloglik@1.75281:34.678754+3.054606 
## [1001]   train-tweedie-nloglik@1.75281:16.164603+0.097345    test-tweedie-nloglik@1.75281:17.212687+0.750414 
## [1501]   train-tweedie-nloglik@1.75281:14.603756+0.074138    test-tweedie-nloglik@1.75281:15.953423+0.574503 
## [2001]   train-tweedie-nloglik@1.75281:13.968725+0.071299    test-tweedie-nloglik@1.75281:15.654203+0.618075 
## [2501]   train-tweedie-nloglik@1.75281:13.520048+0.068583    test-tweedie-nloglik@1.75281:15.482204+0.654811 
## Stopping. Best iteration:
## [2566]   train-tweedie-nloglik@1.75281:13.469965+0.068246    test-tweedie-nloglik@1.75281:15.476034+0.661855

## [mbo] 0: eta=0.00745; gamma=3.51; max_depth=4; min_child_weight=1387; subsample=0.629; colsample_bytree=0.376; max_delta_step=0.64; tweedie_variance_power=1.83 : y = 14.9 : 365.7 secs : initdesign

## [mbo] 0: eta=0.00941; gamma=1.73; max_depth=6; min_child_weight=1729; subsample=0.665; colsample_bytree=0.277; max_delta_step=0.95; tweedie_variance_power=1.78 : y = 14.8 : 401.0 secs : initdesign

## [mbo] 0: eta=0.00594; gamma=2.63; max_depth=9; min_child_weight=1844; subsample=0.406; colsample_bytree=0.723; max_delta_step=4.15; tweedie_variance_power=1.84 : y = 14.7 : 504.6 secs : initdesign

## [mbo] 0: eta=0.00858; gamma=4.15; max_depth=6; min_child_weight=974; subsample=0.772; colsample_bytree=0.686; max_delta_step=3.91; tweedie_variance_power=1.81 : y = 14.6 : 199.2 secs : initdesign

## [mbo] 0: eta=0.00993; gamma=4.57; max_depth=8; min_child_weight=1068; subsample=0.688; colsample_bytree=0.441; max_delta_step=2.05; tweedie_variance_power=1.77 : y = 14.7 : 239.7 secs : initdesign

## [mbo] 0: eta=0.00698; gamma=1.95; max_depth=3; min_child_weight=1116; subsample=0.581; colsample_bytree=0.647; max_delta_step=4.48; tweedie_variance_power=1.81 : y = 15 : 366.1 secs : initdesign

## [mbo] 0: eta=0.00637; gamma=3.72; max_depth=10; min_child_weight=1934; subsample=0.457; colsample_bytree=0.779; max_delta_step=2.48; tweedie_variance_power=1.82 : y = 14.6 : 415.2 secs : initdesign

## [mbo] 0: eta=0.00794; gamma=2.22; max_depth=9; min_child_weight=840; subsample=0.757; colsample_bytree=0.515; max_delta_step=3.11; tweedie_variance_power=1.79 : y = 14.5 : 206.0 secs : initdesign

## [mbo] 0: eta=0.00825; gamma=4.93; max_depth=7; min_child_weight=616; subsample=0.325; colsample_bytree=0.292; max_delta_step=3.4; tweedie_variance_power=1.84 : y = 14.8 : 213.4 secs : initdesign

## [mbo] 0: eta=0.00871; gamma=2.37; max_depth=3; min_child_weight=1553; subsample=0.221; colsample_bytree=0.324; max_delta_step=0.248; tweedie_variance_power=1.77 : y = 16.1 : 359.8 secs : initdesign

## [mbo] 0: eta=0.0073; gamma=4.34; max_depth=2; min_child_weight=1484; subsample=0.526; colsample_bytree=0.618; max_delta_step=1.99; tweedie_variance_power=1.76 : y = 16.3 : 233.8 secs : initdesign

## [mbo] 0: eta=0.00608; gamma=1.07; max_depth=2; min_child_weight=1280; subsample=0.293; colsample_bytree=0.402; max_delta_step=2.82; tweedie_variance_power=1.8 : y = 15.5 : 196.2 secs : initdesign

## [mbo] 0: eta=0.00519; gamma=3.23; max_depth=5; min_child_weight=514; subsample=0.5; colsample_bytree=0.527; max_delta_step=1.51; tweedie_variance_power=1.85 : y = 14.9 : 380.0 secs : initdesign

## [mbo] 0: eta=0.0091; gamma=1.49; max_depth=6; min_child_weight=321; subsample=0.362; colsample_bytree=0.21; max_delta_step=1.12; tweedie_variance_power=1.79 : y = 14.7 : 300.6 secs : initdesign

## [mbo] 0: eta=0.00543; gamma=2.87; max_depth=10; min_child_weight=731; subsample=0.243; colsample_bytree=0.595; max_delta_step=4.85; tweedie_variance_power=1.75 : y = 15.5 : 350.0 secs : initdesign

## [1]  train-tweedie-nloglik@1.81205:340.715326+9.076589   test-tweedie-nloglik@1.81205:340.723718+36.318898 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81205 for early stopping.
## Will train until test_tweedie_nloglik@1.81205 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.81205:15.918866+0.102574    test-tweedie-nloglik@1.81205:16.461088+0.571469 
## [1001]   train-tweedie-nloglik@1.81205:13.785034+0.049437    test-tweedie-nloglik@1.81205:14.672956+0.330847 
## [1501]   train-tweedie-nloglik@1.81205:13.252387+0.053019    test-tweedie-nloglik@1.81205:14.476279+0.413990 
## [2001]   train-tweedie-nloglik@1.81205:12.879603+0.050442    test-tweedie-nloglik@1.81205:14.391401+0.501251 
## Stopping. Best iteration:
## [2023]   train-tweedie-nloglik@1.81205:12.864722+0.050294    test-tweedie-nloglik@1.81205:14.387596+0.505142

## [mbo] 1: eta=0.00954; gamma=1.9; max_depth=8; min_child_weight=1225; subsample=0.731; colsample_bytree=0.349; max_delta_step=2.2; tweedie_variance_power=1.81 : y = 14.4 : 249.3 secs : infill_cb

## [1]  train-tweedie-nloglik@1.80087:343.092389+9.184359   test-tweedie-nloglik@1.80087:343.097986+36.745005 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80087 for early stopping.
## Will train until test_tweedie_nloglik@1.80087 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80087:113.222479+2.921018   test-tweedie-nloglik@1.80087:113.224371+11.686362 
## [1001]   train-tweedie-nloglik@1.80087:41.464624+0.928879    test-tweedie-nloglik@1.80087:41.465188+3.716334 
## [1501]   train-tweedie-nloglik@1.80087:20.439788+0.295234    test-tweedie-nloglik@1.80087:20.440313+1.181873 
## [2001]   train-tweedie-nloglik@1.80087:15.711758+0.113962    test-tweedie-nloglik@1.80087:15.898547+0.431068 
## [2501]   train-tweedie-nloglik@1.80087:14.325745+0.070814    test-tweedie-nloglik@1.80087:15.013139+0.345294 
## [3001]   train-tweedie-nloglik@1.80087:13.676998+0.056270    test-tweedie-nloglik@1.80087:14.686207+0.365084 
## [3501]   train-tweedie-nloglik@1.80087:13.235437+0.057295    test-tweedie-nloglik@1.80087:14.506423+0.408045 
## [4001]   train-tweedie-nloglik@1.80087:12.878366+0.056469    test-tweedie-nloglik@1.80087:14.409738+0.476704 
## Stopping. Best iteration:
## [4439]   train-tweedie-nloglik@1.80087:12.602206+0.057227    test-tweedie-nloglik@1.80087:14.353508+0.526717

## [mbo] 2: eta=0.00911; gamma=1.77; max_depth=8; min_child_weight=868; subsample=0.791; colsample_bytree=0.549; max_delta_step=0.314; tweedie_variance_power=1.8 : y = 14.4 : 422.2 secs : infill_cb

## [1]  train-tweedie-nloglik@1.79891:341.616339+9.151591   test-tweedie-nloglik@1.79891:341.621362+36.614048 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79891 for early stopping.
## Will train until test_tweedie_nloglik@1.79891 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.79891:21.260748+0.323534    test-tweedie-nloglik@1.79891:21.276269+1.295378 
## [1001]   train-tweedie-nloglik@1.79891:14.082018+0.051754    test-tweedie-nloglik@1.79891:14.875722+0.333678 
## [1501]   train-tweedie-nloglik@1.79891:13.233551+0.051804    test-tweedie-nloglik@1.79891:14.484654+0.410128 
## [2001]   train-tweedie-nloglik@1.79891:12.712292+0.049849    test-tweedie-nloglik@1.79891:14.331855+0.503634 
## Stopping. Best iteration:
## [2290]   train-tweedie-nloglik@1.79891:12.472694+0.053111    test-tweedie-nloglik@1.79891:14.298177+0.575406

## [mbo] 3: eta=0.00942; gamma=3.15; max_depth=9; min_child_weight=843; subsample=0.783; colsample_bytree=0.271; max_delta_step=0.888; tweedie_variance_power=1.8 : y = 14.3 : 252.1 secs : infill_cb

## [1]  train-tweedie-nloglik@1.8023:340.514740+9.107514    test-tweedie-nloglik@1.8023:340.523969+36.446491 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8023 for early stopping.
## Will train until test_tweedie_nloglik@1.8023 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.8023:14.909694+0.100825 test-tweedie-nloglik@1.8023:16.029717+0.614204 
## [1001]   train-tweedie-nloglik@1.8023:12.276371+0.058075 test-tweedie-nloglik@1.8023:14.299033+0.636373 
## Stopping. Best iteration:
## [1016]   train-tweedie-nloglik@1.8023:12.243908+0.059515 test-tweedie-nloglik@1.8023:14.297354+0.646123

## [mbo] 4: eta=0.00999; gamma=2.8; max_depth=9; min_child_weight=342; subsample=0.683; colsample_bytree=0.49; max_delta_step=1.58; tweedie_variance_power=1.8 : y = 14.3 : 155.0 secs : infill_cb

## [1]  train-tweedie-nloglik@1.79965:340.693542+9.123636   test-tweedie-nloglik@1.79965:340.698126+36.502615 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79965 for early stopping.
## Will train until test_tweedie_nloglik@1.79965 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.79965:15.513652+0.107246    test-tweedie-nloglik@1.79965:16.291323+0.599968 
## [1001]   train-tweedie-nloglik@1.79965:13.127623+0.057521    test-tweedie-nloglik@1.79965:14.478948+0.468251 
## [1501]   train-tweedie-nloglik@1.79965:12.450628+0.057462    test-tweedie-nloglik@1.79965:14.320771+0.610130 
## Stopping. Best iteration:
## [1521]   train-tweedie-nloglik@1.79965:12.427706+0.057687    test-tweedie-nloglik@1.79965:14.317007+0.616377

## [mbo] 5: eta=0.01; gamma=2.63; max_depth=10; min_child_weight=1081; subsample=0.698; colsample_bytree=0.493; max_delta_step=1.18; tweedie_variance_power=1.8 : y = 14.3 : 205.2 secs : infill_cb

## [1]  train-tweedie-nloglik@1.79543:340.537445+9.134006   test-tweedie-nloglik@1.79543:340.552142+36.558473 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79543 for early stopping.
## Will train until test_tweedie_nloglik@1.79543 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.79543:15.423422+0.099615    test-tweedie-nloglik@1.79543:16.142575+0.549508 
## [1001]   train-tweedie-nloglik@1.79543:13.201343+0.056114    test-tweedie-nloglik@1.79543:14.503582+0.419961 
## [1501]   train-tweedie-nloglik@1.79543:12.487963+0.057690    test-tweedie-nloglik@1.79543:14.305727+0.556182 
## Stopping. Best iteration:
## [1697]   train-tweedie-nloglik@1.79543:12.274374+0.058514    test-tweedie-nloglik@1.79543:14.272447+0.606904

## [mbo] 6: eta=0.00999; gamma=1.42; max_depth=9; min_child_weight=584; subsample=0.769; colsample_bytree=0.272; max_delta_step=2.29; tweedie_variance_power=1.8 : y = 14.3 : 217.2 secs : infill_cb

## [1]  train-tweedie-nloglik@1.79417:341.503680+9.166254   test-tweedie-nloglik@1.79417:341.508606+36.672458 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79417 for early stopping.
## Will train until test_tweedie_nloglik@1.79417 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.79417:19.509591+0.261787    test-tweedie-nloglik@1.79417:19.639546+1.096469 
## [1001]   train-tweedie-nloglik@1.79417:13.144418+0.052379    test-tweedie-nloglik@1.79417:14.496626+0.423258 
## [1501]   train-tweedie-nloglik@1.79417:12.073966+0.057633    test-tweedie-nloglik@1.79417:14.226620+0.590800 
## Stopping. Best iteration:
## [1514]   train-tweedie-nloglik@1.79417:12.054614+0.057727    test-tweedie-nloglik@1.79417:14.224718+0.595632

## [mbo] 7: eta=0.00991; gamma=1.58; max_depth=10; min_child_weight=453; subsample=0.77; colsample_bytree=0.352; max_delta_step=0.902; tweedie_variance_power=1.79 : y = 14.2 : 182.2 secs : infill_cb

## [1]  train-tweedie-nloglik@1.80093:342.887091+9.178540   test-tweedie-nloglik@1.80093:342.892096+36.721835 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80093 for early stopping.
## Will train until test_tweedie_nloglik@1.80093 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80093:85.321825+2.153553    test-tweedie-nloglik@1.80093:85.322955+8.616076 
## [1001]   train-tweedie-nloglik@1.80093:26.996887+0.505121    test-tweedie-nloglik@1.80093:26.997190+2.021150 
## [1501]   train-tweedie-nloglik@1.80093:16.226591+0.120683    test-tweedie-nloglik@1.80093:16.298045+0.500625 
## [2001]   train-tweedie-nloglik@1.80093:14.522975+0.055452    test-tweedie-nloglik@1.80093:15.077755+0.326529 
## [2501]   train-tweedie-nloglik@1.80093:13.705078+0.062120    test-tweedie-nloglik@1.80093:14.643362+0.337972 
## [3001]   train-tweedie-nloglik@1.80093:13.174421+0.062708    test-tweedie-nloglik@1.80093:14.423376+0.381127 
## [3501]   train-tweedie-nloglik@1.80093:12.755801+0.063702    test-tweedie-nloglik@1.80093:14.289940+0.439437 
## Stopping. Best iteration:
## [3829]   train-tweedie-nloglik@1.80093:12.516629+0.066593    test-tweedie-nloglik@1.80093:14.231650+0.485118

## [mbo] 8: eta=0.00908; gamma=1.13; max_depth=10; min_child_weight=301; subsample=0.709; colsample_bytree=0.2; max_delta_step=0.399; tweedie_variance_power=1.8 : y = 14.2 : 391.0 secs : infill_cb

## [1]  train-tweedie-nloglik@1.80353:340.954138+9.115118   test-tweedie-nloglik@1.80353:340.965655+36.477642 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping.
## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80353:16.280642+0.145262    test-tweedie-nloglik@1.80353:17.426506+0.818395 
## [1001]   train-tweedie-nloglik@1.80353:12.257854+0.055505    test-tweedie-nloglik@1.80353:14.196385+0.579480 
## Stopping. Best iteration:
## [1075]   train-tweedie-nloglik@1.80353:12.083582+0.060371    test-tweedie-nloglik@1.80353:14.174686+0.630563

## [mbo] 9: eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8 : y = 14.2 : 165.5 secs : infill_cb

## [1]  train-tweedie-nloglik@1.8168:341.102509+9.068973    test-tweedie-nloglik@1.8168:341.113751+36.288728 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8168 for early stopping.
## Will train until test_tweedie_nloglik@1.8168 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.8168:16.397836+0.150262 test-tweedie-nloglik@1.8168:17.661101+0.874430 
## [1001]   train-tweedie-nloglik@1.8168:12.316689+0.061103 test-tweedie-nloglik@1.8168:14.303786+0.625352 
## Stopping. Best iteration:
## [1019]   train-tweedie-nloglik@1.8168:12.277460+0.060180 test-tweedie-nloglik@1.8168:14.299007+0.638777

## [mbo] 10: eta=0.00858; gamma=1.26; max_depth=10; min_child_weight=467; subsample=0.798; colsample_bytree=0.641; max_delta_step=1.38; tweedie_variance_power=1.82 : y = 14.3 : 170.7 secs : infill_cb

## $eta
## [1] 0.008678109
## 
## $gamma
## [1] 1.423488
## 
## $max_depth
## [1] 10
## 
## $min_child_weight
## [1] 302
## 
## $subsample
## [1] 0.7986374
## 
## $colsample_bytree
## [1] 0.3759013
## 
## $max_delta_step
## [1] 1.391834
## 
## $tweedie_variance_power
## [1] 1.803532

Diagnostics and evaluating result

Results in hand, we want to check some diagnostics starting with the objective function evaluation for all of the runs.

runs$plot

The plot above (see our do_bayes() function for how we extracted this info) shows the best test evaluation result for each run - the initial design is colored red and the optimization runs are in blue. The hyperparameter values which produced those evaluations were the ones chosen through kriging. We can see from this that none of the random evaluations gave a top result, but together they did provide solid information on where in the hyperparameter space we should focus our search in order to optimize the algorithm. Every subsequent proposal was better than all of the random ones.

The “default” viz below comes from the plot() S3 class method for MBOSingleObjResult and shows a few useful things, although the formatting could be improved. Most importantly, the top left plot shows the “scaled” values for each set of hyperparameters, for each run. Use this to confirm your recommended solution does not include any hyperparameter at the boundary of the values tested - if it does, then expand the range of that parameter in your objective function and re-run. In my example below, you can see that the optimal solution (the green line) includes a value for max_depth at the maximum of 10, and a min_child_weight at or near the minimum (300) of the range I had allowed. Unless I were intentionally using these bounds to limit model complexity and improve generalization, I should try expanding the ranges of these hyperparameters and running again.

class(runs$run) %>% print

## [1] "MBOSingleObjResult" "MBOResult"

plot(runs$run)

If you print the result object you can confirm the recommended solution included these boundary values:

print(runs$run)

## Recommended parameters:
## eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8
## Objective: y = 14.175
## 
## Optimization path
## 15 + 10 entries in total, displaying last 10 (or less):
##            eta    gamma max_depth min_child_weight subsample colsample_bytree max_delta_step tweedie_variance_power
## 16 0.009543908 1.904103         8             1225 0.7305367        0.3488211      2.1958369               1.812051
## 17 0.009114593 1.773073         8              868 0.7911231        0.5494522      0.3138641               1.800874
## 18 0.009421640 3.150876         9              843 0.7830561        0.2705329      0.8878380               1.798907
## 19 0.009990520 2.801814         9              342 0.6832292        0.4898784      1.5775174               1.802298
## 20 0.009996788 2.626826        10             1081 0.6979587        0.4934878      1.1786900               1.799649
## 21 0.009986297 1.419945         9              584 0.7690373        0.2716777      2.2940200               1.795434
## 22 0.009913078 1.579795        10              453 0.7699498        0.3517838      0.9017585               1.794166
## 23 0.009082225 1.130765        10              301 0.7085142        0.2004467      0.3985858               1.800927
## 24 0.008678109 1.423488        10              302 0.7986374        0.3759013      1.3918335               1.803532
## 25 0.008579770 1.258053        10              467 0.7982686        0.6407762      1.3812183               1.816803
##           y dob eol error.message exec.time       cb error.model train.time prop.type propose.time         se     mean
## 16 14.38760   1  NA             249.294 14.08443              0.417 infill_cb        1.041 0.26275330 14.34718
## 17 14.35351   2  NA             422.219 14.20226              0.150 infill_cb        1.771 0.21492201 14.41718
## 18 14.29818   3  NA             252.086 14.20628              0.123 infill_cb        1.537 0.19145391 14.39773
## 19 14.29735   4  NA             155.010 14.19001              0.174 infill_cb        1.474 0.15554575 14.34556
## 20 14.31701   5  NA             205.205 14.19583              0.333 infill_cb        1.299 0.15011713 14.34595
## 21 14.27245   6  NA             217.224 14.20543              0.122 infill_cb        1.356 0.12818859 14.33362
## 22 14.22472   7  NA             182.158 14.19185              0.101 infill_cb        1.510 0.10787970 14.29973
## 23 14.23165   8  NA             390.982 14.16640              0.180 infill_cb        1.438 0.12182032 14.28822
## 24 14.17469   9  NA             165.500 14.16323              0.113 infill_cb        1.142 0.08824657 14.25147
## 25 14.29901  10  NA             170.738 14.12875              0.097 infill_cb        1.179 0.14723428 14.27598
##    lambda
## 16      1
## 17      1
## 18      1
## 19      1
## 20      1
## 21      1
## 22      1
## 23      1
## 24      1
## 25      1

Using the result

Assuming we are happy with the result, we should then have what we need to proceed with model training. However, since xgb.cv() now uses early stopping and nrounds is not a tuning parameter, we did not capture this needed information in our MBO result. So we need to run one more evaluation the old-fashioned way, calling xgb.cv() directly using the best hyperparameters we found.

best.params <- runs$run$x
print(best.params)

## $eta
## [1] 0.008678109
## 
## $gamma
## [1] 1.423488
## 
## $max_depth
## [1] 10
## 
## $min_child_weight
## [1] 302
## 
## $subsample
## [1] 0.7986374
## 
## $colsample_bytree
## [1] 0.3759013
## 
## $max_delta_step
## [1] 1.391834
## 
## $tweedie_variance_power
## [1] 1.803532

We add the model parameters which were fixed during optimization to this list:

best.params$booster <- "gbtree"
best.params$objective <- "reg:tweedie"

Now we cross-validate the number of rounds to use, fixing our best hyperparameters:

optimal.cv <- xgb.cv(params = best.params,
                     data = fre_dm,
                     nrounds = 6000,
                     nthread = 26,
                     nfold = 5,
                     prediction = FALSE,
                     showsd = TRUE,
                     early_stopping_rounds = 25,
                     verbose = 1,
                     print_every_n = 500)

## [1]  train-tweedie-nloglik@1.80353:340.950989+7.692406   test-tweedie-nloglik@1.80353:340.929736+30.743659 
## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping.
## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds.
## 
## [501]    train-tweedie-nloglik@1.80353:16.281491+0.131605    test-tweedie-nloglik@1.80353:17.436350+0.607008 
## [1001]   train-tweedie-nloglik@1.80353:12.254035+0.041286    test-tweedie-nloglik@1.80353:14.250821+0.548260 
## Stopping. Best iteration:
## [1060]   train-tweedie-nloglik@1.80353:12.114213+0.035136    test-tweedie-nloglik@1.80353:14.242658+0.593849

Obtain the best number of rounds…

best.params$nrounds <- optimal.cv$best_ntreelimit
best.params[[11]] %>% print

## [1] 1060

…and finally, train the final learner:

final.model <- xgboost(params = best.params[-11], ## do not include nrounds here
                       data = fre_dm,
                       nrounds = best.params$nrounds,
                       verbose = 1,
                       print_every_n = 500)

## [1]  train-tweedie-nloglik@1.80353:340.952393 
## [501]    train-tweedie-nloglik@1.80353:16.252325 
## [1001]   train-tweedie-nloglik@1.80353:12.189503 
## [1060]   train-tweedie-nloglik@1.80353:12.041411

xgb.importance(model = final.model) %>% xgb.plot.importance()

Conclusion

Bayesian optimization is a smart approach for tuning more complex learning algorithms with many hyperparameters when compute resources are slowing down the analysis. It is commonly used in deep learning, but can also be useful to when working with machine learning algorithms like GBMs (shown here), random forests, support vector machines - really anything that is going to take you too much time to run a naive grid search. Even if you are working with a relatively simple algorithm - say a lasso regression, which involves a single hyperparameter \(\lambda\) to control the shrinkage/penalty - you may have just a small amount of compute available to you. If so, then it could still make sense to use MBO and cut down the number of evaluations needed to find the optimum.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science, Insurance, Bikes, and the Meaning of Life.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.