Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Model-based optimization (MBO) is a smart approach to tuning the hyperparameters of machine learning algorithms with less CPU time and manual effort than standard grid search approaches. The core idea behind MBO is to directly evaluate fewer points within a hyperparameter space, and to instead use a “surrogate model” which estimates what the result of your objective function would be in new locations by interpolating (not linearly) between the observed results in a small sample of initial evaluations. Many methods can be used to construct the surrogate model. This post will focus on implementing the bayesian method of Gaussian Process (GP) smoothing (aka “kriging”) which is borrowed from – and particularly well-suited to – spatial applications.
Background
I remember when I started using machine learning methods how time consuming and – even worse – manual it could be to perform a hyperparameter search. The whole benefit of machine learning is that the algorithm should optimize the model-learning task for us, right? The problem of course becomes one of compute resources. Suppose we only have simple brute-force grid search at our disposal. With just one hyperparameter to tune, this approach is practical – we may only need to test 5 candidate values. But as the number of hyperparameters (“dimension”) increments, the number of candidate hyperparameterizations increases according to a power function. Suppose instead we have 5 hyperparameters to tune – again using just points five points for each dimension would now result in \(5^5 = 3,125\) model evaluations to test all the possible combinations. Sometimes 5 points is realistic – for example, with some discrete parameters like the maximum tree depth in a random forest. But for something continuous it is usually not, so I am really understating how quickly a grid will blow up, making brute-force approaches impractical.
Pretty quickly one goes from the brute-force appraoch to more involved strategies for grid search. That could mean starting with coarse grids and zooming into specific promising areas with higher resolution grids in subsequent searches; it could mean iterating between two or three different subsets of the hyperparameters which tend to “move together” – like the learning rate and number of rounds in a GBM. These strategies become highly manual, and frankly it becomes a real effort to keep track of the different runs and results. We don't want to have to think this much and risk making a mistake when tuning algorithms!
Model-Based Optimization
MBO differs from grid search in a couple of ways. First, we search the entire continuous range of a hyperparameter, not a discretized set of points within that range. Second, and more importantly, it is a probabilistic method which uses information from early evaluations to improve the selection of subsequent tests that will be run. In this regard it is similar to the low-res/high-res search strategy, but with automation. As good Bayesians, we like methods that incorporate prior information to improve later decisions, a principle which is intuitive and appealing to our naturally bayesian brains.
As mentioned above, the method for selecting later test points based on the information from the early tests is gaussian process smoothing or kriging. One popular application for Gaussian processes is in geo-spatial smoothing and regression. We are basically doing the same thing here, except instead of geographic (lat-long) space, our space is defined by the ranges of a set of hyperparameters. We refer to this as the hyperparameter space, and MBO is going to help us search it for the point which provides the optimal result of a machine learning algorithm.
So let's take a look at how Bayes helps us tune machine learning algorithms with some code.
Demonstration
Environment
The main package we need is mlrMBO
, which provides the mbo()
method for optimizing an arbitrary function sequentially. We also need several others for various helpers – smoof
to define the objective function which will be optimized; ParamHelpers
to define a parameter space in which we will perform the bayesian search for a global optimum; and DiceKriging
provides the gaussian process interpolation (in the machine learning world it is called “kriging”) capability.
We will use the xgboost
flavor of GBM as our machine learning methodology to be tuned, but you could adapt what I'm demonstrating here to any algorithm with multiple hyperparameters (or even a single one, if run-time for a single iteration was so high as to warrant it). mlrMBO
is completely agnostic to your choice of methodology, but the flip side is this means a bit of coding setup required on the data scientist's part (good thing we like coding, and don't like manual work).
library(CASdatasets) library(dplyr) library(tibble) library(magrittr) library(ggplot2) library(scatterplot3d) library(kableExtra) library(tidyr) library(mlrMBO) library(ParamHelpers) library(DiceKriging) library(smoof) library(xgboost)
Data
I'll use my go-to insurance ratemaking dataset for demonstration purposes – the french motor dataset from CASdatasets
.
data("freMPL1") data("freMPL2") data("freMPL3") fre_df < - rbind(freMPL1, freMPL2, freMPL3 %>% select(-DeducType)) rm(freMPL1, freMPL2, freMPL3)
Let's take a look at our target variable ClaimAmount
gridExtra::grid.arrange( fre_df %>% filter(ClaimAmount > 0) %>% ggplot(aes(x = ClaimAmount)) + geom_density() + ggtitle("Observed Loss Distribution"), fre_df %>% filter(ClaimAmount > 0, ClaimAmount < 1.5e4) %>% ggplot(aes(x = ClaimAmount)) + geom_density() + ggtitle("Observed Severity Distribution"), nrow = 1 )
We have something like a compound distribution – a probability mass at 0, and some long-tailed distribution of loss dollars for observations with incurred claims. But let's also look beyond the smoothed graphical view.
min(fre_df$ClaimAmount) ## [1] -3407.7 sum(fre_df$ClaimAmount < 0) ## [1] 690
We also appear to have some claims < 0 - perhaps recoveries (vehicle salvage) exceeded payments. For the sake of focusing on the MBO, we will adjust these records by flooring values at 0. I'll also convert some factor columns to numeric types which make more sense for modeling.
fre_df %% mutate(ClaimAmount = case_when(ClaimAmount < 0 ~ 0, TRUE ~ ClaimAmount)) %>% mutate(VehMaxSpeed_num = sub(".*-", "", VehMaxSpeed) %>% substr(., 1, 3)%>% as.numeric, VehAge_num = sub("*.-", "", VehAge) %>% sub('\\+', '', .) %>% as.numeric, VehPrice_num = as.integer(VehPrice)) %>% # The factor levels appear to be ordered so I will use this group_by(SocioCateg) %>% # high cardinality, will encode as a proportion of total mutate(SocioCateg_prop = (sum(n()) / 4) / nrow(.) * 1e5) %>% ungroup() ## matrices, no intercept needed and don't forget to exclude post-dictors fre_mat <- model.matrix(ClaimAmount ~ . -1 -ClaimInd -Exposure -RecordBeg -RecordEnd - VehMaxSpeed -VehPrice -VehAge -SocioCateg, data = fre_df) ## xgb.DMatrix, faster sparse matrix fre_dm <- xgb.DMatrix(data = fre_mat, label = fre_df$ClaimAmount, base_margin = log(fre_df$Exposure)) ## base-margin == offset ## we use log earned exposure because the xgboost Tweedie ## implementation includes a log-link for the variance power
Objective function for optimizing
To avoid confusion, there are two objective functions we could refer to. Statistically, our objective function aka our loss funciton is negative log-likelihood for am assumed tweeedie-distributed random variable. The xgboost
algorithm will minimize this objective (equivalent to maximizing likelihood) for a given set of hyper-parameters for each run. Our other objective function is the R function defined below - it calls xgb.cv()
, runs the learning procedure with cross-validation, stops when the out-of-fold likelihood does not improve, and returns the best objective evaluation (log-loss metric) based on the out-of-fold samples.
Note that the function below also includes a defined hyperparameter space - a set of tuning parameters with possible ranges for values. There are 6 traditional tuning parameters for xgboost, but I've also added the tweedie variance “power” parameter as a seventh. This parameter would take a value between (1,2) for a poisson-gamma compound distribution, but I first narrowed this down to a smaller range based on a quick profile of the loss distribution (using tweedie::tweedie.profile()
, omitted here).
# Adapted for Tweedie likelihood from this very good post at https://www.simoncoulombe.com/2019/01/bayesian/ # objective function: we want to minimize the neg log-likelihood by tuning hyperparameters obj.fun <- makeSingleObjectiveFunction( name = "xgb_cv_bayes", fn = function(x){ set.seed(42) cv <- xgb.cv(params = list( booster = "gbtree", eta = x["eta"], max_depth = x["max_depth"], min_child_weight = x["min_child_weight"], gamma = x["gamma"], subsample = x["subsample"], colsample_bytree = x["colsample_bytree"], max_delta_step = x["max_delta_step"], tweedie_variance_power = x["tweedie_variance_power"], objective = 'reg:tweedie', eval_metric = paste0("tweedie-nloglik@", x["tweedie_variance_power"])), data = dm, ## must set in global.Env() nround = 7000, ## Set this large and use early stopping nthread = 26, ## Adjust based on your machine nfold = 5, prediction = FALSE, showsd = TRUE, early_stopping_rounds = 25, ## If evaluation metric does not improve on out-of-fold sample for 25 rounds, stop verbose = 1, print_every_n = 500) cv$evaluation_log %>% pull(4) %>% min ## column 4 is the eval metric here, tweedie negative log-likelihood }, par.set = makeParamSet( makeNumericParam("eta", lower = 0.005, upper = 0.01), makeNumericParam("gamma", lower = 1, upper = 5), makeIntegerParam("max_depth", lower= 2, upper = 10), makeIntegerParam("min_child_weight", lower= 300, upper = 2000), makeNumericParam("subsample", lower = 0.20, upper = .8), makeNumericParam("colsample_bytree", lower = 0.20, upper = .8), makeNumericParam("max_delta_step", lower = 0, upper = 5), makeNumericParam("tweedie_variance_power", lower = 1.75, upper = 1.85) ), minimize = TRUE ## negative log likelihood )
A function which runs the optimization
The core piece here is the call to mbo()
. This accepts an initial design - i.e. a set of locations which are chosen to be “space-filling” within our hyperparameter space (we do not want randomn generation which could result in areas of the space having no points nearby) - created using ParamHelpers::generateDesign()
. The makeMBOControl()
method is used to create an object which will simply tell mbo()
how many optimization steps to run after the intial design is tested - these are the runs which are determined probabilistically through gaussian process smoothing, aka kriging. Finally, I create a plot of the optimization path and return the objects in a list for later use.
The covariance structure used in the gaussian process is what makes GPs “bayesian” - they define the prior information as a function of nearby observed values and the covariance structure which defines the level of smoothness expected. We use a Matern 3/2 kernel - this is a moderately smooth covariance often used in geospatial applications and which is well-suited to our own spatial task. It is equivalent to the product of an exponential and a polynomial of degree 1. This is the mbo
default for a numerical hyperparameter space - if your hyperparameters include some which are non-numeric (for example, you may have a hyperparameter for “method” and a set of methods to choose from), then instead of kriging a random forest is used to estimate the value of the objective function between points, and from this the optimizing proposals are chosen. This would no longer be a strictly “bayesian” approach, though I think it would still be bayesian in spirit.
The gaussian process models the result of our objective function's output as a function of hyperparameter values, using the initial design samples. For this reason, it is referred to (especially in the deep learning community) as a surrogate model - it serves as a cheap surrogate for running another evaluation of our objective function at some new point. For any point not evaluated directly, the estimated/interpolated surface provides an expectation. This benefits us because points that are likely to perform poorly (based on the surrogate model estimate) will be discarded, and we will only move on with directly evaluating points in promising regions of the hyperparameter space.
Creating a wrapper function is optional - but to perform multiple runs in an analysis, most of the code here would need to be repeated. To be concise, I write it once so it can be called for subsequent runs (perhaps on other datasets, or if you get back a boundary solution you did not anticipate).
do_bayes <- function(n_design = NULL, opt_steps = NULL, of = obj.fun, seed = 42) { set.seed(seed) des <- generateDesign(n=n_design, par.set = getParamSet(of), fun = lhs::randomLHS) control < - makeMBOControl() %>% setMBOControlTermination(., iters = opt_steps) ## kriging with a matern(3,2) covariance function is the default surrogate model for numerical domains ## but if you wanted to override this you could modify the makeLearner() call below to define your own ## GP surrogate model with more or lesss smoothness, or use an entirely different method run <- mbo(fun = of, design = des, learner = makeLearner("regr.km", predict.type = "se", covtype = "matern3_2", control = list(trace = FALSE)), control = control, show.info = TRUE) opt_plot < - run$opt.path$env$path %>% mutate(Round = row_number()) %>% mutate(type = case_when(Round
Number of evaluations
Normally for this problem I would perform more evaluations, in both the intial and optimizing phases. Something around 5 -7 times the number of parameters being tuned for the initial design and half of that for the number of optimization steps could be a rule of thumb. You need some points in the space to have something to interpolate between!
Here's my intial design of 15 points.
des <- generateDesign(n=15, par.set = getParamSet(obj.fun), fun = lhs::randomLHS) kable(des, format = "html", digits = 4) %>% kable_styling(_size = 10) %>% kable_material_dark()
eta | gamma | max_depth | min_child_weight | subsample | colsample_bytree | max_delta_step | tweedie_variance_power |
---|---|---|---|---|---|---|---|
0.0064 | 1.7337 | 5 | 1259 | 0.3795 | 0.3166 | 4.0727 | 1.8245 |
0.0074 | 3.2956 | 6 | 1073 | 0.4897 | 0.2119 | 1.0432 | 1.7833 |
0.0062 | 1.4628 | 9 | 1451 | 0.7799 | 0.5837 | 3.0967 | 1.8219 |
0.0059 | 4.5578 | 3 | 703 | 0.2135 | 0.3206 | 1.6347 | 1.8346 |
0.0081 | 4.0202 | 8 | 445 | 0.4407 | 0.5495 | 2.2957 | 1.7879 |
0.0067 | 3.5629 | 3 | 330 | 0.4207 | 0.4391 | 0.9399 | 1.7998 |
0.0092 | 4.4374 | 7 | 1937 | 0.6411 | 0.7922 | 3.9315 | 1.7704 |
0.0099 | 1.8265 | 7 | 1595 | 0.2960 | 0.3961 | 2.5070 | 1.8399 |
0.0055 | 2.4464 | 9 | 805 | 0.5314 | 0.2457 | 3.6427 | 1.7624 |
0.0096 | 2.2831 | 10 | 1662 | 0.6873 | 0.6075 | 0.2518 | 1.8457 |
0.0079 | 2.8737 | 4 | 894 | 0.2790 | 0.4954 | 0.4517 | 1.7965 |
0.0087 | 2.8600 | 6 | 599 | 0.5717 | 0.6537 | 4.9145 | 1.7557 |
0.0053 | 1.0082 | 2 | 1863 | 0.6021 | 0.7436 | 2.7048 | 1.7686 |
0.0071 | 4.9817 | 2 | 1380 | 0.3599 | 0.4507 | 4.3572 | 1.8156 |
0.0086 | 3.6847 | 9 | 1124 | 0.7373 | 0.6935 | 1.9759 | 1.8043 |
And here is a view to see how well those points fill out 3 of the 7 dimensions
scatterplot3d(des$eta, des$gamma, des$min_child_weight, type = "h", color = "blue", pch = 16)
We can see large areas with no nearby points - if the global optimum lies here, we may still end up with proposals in this area that lead us to find it, but it sure would be helpful to gather some info there and guarantee it. Here's a better design with 6 points per hyperparameter.
des <- generateDesign(n=42, par.set = getParamSet(obj.fun), fun = lhs::randomLHS) scatterplot3d(des$eta, des$gamma, des$min_child_weight, type = "h", color = "blue", pch = 16)
This would take longer to run, but we will rely less heavily on interpolation over long distances during the optimizing phase because we have more information observed through experiments. Choosing your design is about the trade-off between desired accuracy and computational expense. So use as many points in the initial design as you can afford time for (aiming for at least 5-7 per parameter), and maybe half as many for the number of subsequent optimization steps.
Do Bayes!
Now that we are all set up, let's run the procedure using our do_bayes()
function above and evaluate the result. As discussed above, I recommend sizing your random design and optimization steps according to the size of your hyperparameter space, using 5-7 points per hyperparameter as a rule of thumb. You can also figure out roughly how much time a single evaluation take (which will depend on the hyperparameter values, so this should be an estimate of the mean time), as well as how much time you can budget, and then choose the values that work for you. Here I use 25 total runs - 15 initial evaluations, and 10 optimization steps.
(Note: The verbose output for each evaluation is shown below for your interest)
dm <- fre_dm runs <- do_bayes(n_design = 15, of = obj.fun, opt_steps = 10, seed = 42) ## Computing y column(s) for design. Not provided. ## [1] train-tweedie-nloglik@1.82975:342.947351+9.068011 test-tweedie-nloglik@1.82975:342.950897+36.281204 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.82975 for early stopping. ## Will train until test_tweedie_nloglik@1.82975 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.82975:54.102369+1.254521 test-tweedie-nloglik@1.82975:54.103011+5.019251 ## [1001] train-tweedie-nloglik@1.82975:17.471831+0.173606 test-tweedie-nloglik@1.82975:17.486700+0.699454 ## [1501] train-tweedie-nloglik@1.82975:15.143893+0.056641 test-tweedie-nloglik@1.82975:15.371196+0.251584 ## [2001] train-tweedie-nloglik@1.82975:14.794773+0.047399 test-tweedie-nloglik@1.82975:15.161293+0.229903 ## [2501] train-tweedie-nloglik@1.82975:14.583392+0.051122 test-tweedie-nloglik@1.82975:15.069248+0.245231 ## [3001] train-tweedie-nloglik@1.82975:14.419826+0.051263 test-tweedie-nloglik@1.82975:14.996046+0.258229 ## [3501] train-tweedie-nloglik@1.82975:14.281899+0.049542 test-tweedie-nloglik@1.82975:14.944954+0.278042 ## [4001] train-tweedie-nloglik@1.82975:14.162422+0.046876 test-tweedie-nloglik@1.82975:14.902480+0.299742 ## [4501] train-tweedie-nloglik@1.82975:14.056329+0.045482 test-tweedie-nloglik@1.82975:14.861813+0.318562 ## Stopping. Best iteration: ## [4597] train-tweedie-nloglik@1.82975:14.037272+0.045602 test-tweedie-nloglik@1.82975:14.852832+0.320272 ## ## [1] train-tweedie-nloglik@1.78053:341.751507+9.223267 test-tweedie-nloglik@1.78053:341.757257+36.900142 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78053 for early stopping. ## Will train until test_tweedie_nloglik@1.78053 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.78053:20.067214+0.271193 test-tweedie-nloglik@1.78053:20.130995+1.160850 ## [1001] train-tweedie-nloglik@1.78053:15.076158+0.069745 test-tweedie-nloglik@1.78053:15.599540+0.313749 ## [1501] train-tweedie-nloglik@1.78053:14.517137+0.066595 test-tweedie-nloglik@1.78053:15.309150+0.339692 ## [2001] train-tweedie-nloglik@1.78053:14.163563+0.061668 test-tweedie-nloglik@1.78053:15.147976+0.382249 ## [2501] train-tweedie-nloglik@1.78053:13.890958+0.068184 test-tweedie-nloglik@1.78053:15.034968+0.404933 ## [3001] train-tweedie-nloglik@1.78053:13.663806+0.063876 test-tweedie-nloglik@1.78053:14.950204+0.428143 ## [3501] train-tweedie-nloglik@1.78053:13.467250+0.063427 test-tweedie-nloglik@1.78053:14.885284+0.460394 ## [4001] train-tweedie-nloglik@1.78053:13.293906+0.060834 test-tweedie-nloglik@1.78053:14.837956+0.493384 ## Stopping. Best iteration: ## [4190] train-tweedie-nloglik@1.78053:13.231073+0.059948 test-tweedie-nloglik@1.78053:14.818203+0.503775 ## ## [1] train-tweedie-nloglik@1.8352:342.471167+9.032795 test-tweedie-nloglik@1.8352:342.480517+36.144871 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8352 for early stopping. ## Will train until test_tweedie_nloglik@1.8352 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8352:28.357122+0.477149 test-tweedie-nloglik@1.8352:28.895080+2.147141 ## [1001] train-tweedie-nloglik@1.8352:14.938225+0.058305 test-tweedie-nloglik@1.8352:15.482690+0.376248 ## [1501] train-tweedie-nloglik@1.8352:14.236916+0.048579 test-tweedie-nloglik@1.8352:14.920337+0.307540 ## [2001] train-tweedie-nloglik@1.8352:13.941143+0.047951 test-tweedie-nloglik@1.8352:14.796813+0.353247 ## [2501] train-tweedie-nloglik@1.8352:13.719723+0.047280 test-tweedie-nloglik@1.8352:14.722402+0.396149 ## [3001] train-tweedie-nloglik@1.8352:13.536447+0.045626 test-tweedie-nloglik@1.8352:14.666037+0.429260 ## Stopping. Best iteration: ## [3171] train-tweedie-nloglik@1.8352:13.480152+0.046018 test-tweedie-nloglik@1.8352:14.652766+0.442854 ## ## [1] train-tweedie-nloglik@1.80549:340.990906+9.109456 test-tweedie-nloglik@1.80549:340.997974+36.447143 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80549 for early stopping. ## Will train until test_tweedie_nloglik@1.80549 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80549:17.171271+0.149287 test-tweedie-nloglik@1.80549:17.612757+0.749392 ## [1001] train-tweedie-nloglik@1.80549:14.128140+0.059918 test-tweedie-nloglik@1.80549:14.910039+0.330671 ## [1501] train-tweedie-nloglik@1.80549:13.571417+0.056644 test-tweedie-nloglik@1.80549:14.694386+0.417752 ## Stopping. Best iteration: ## [1847] train-tweedie-nloglik@1.80549:13.292618+0.053193 test-tweedie-nloglik@1.80549:14.617792+0.481956 ## ## [1] train-tweedie-nloglik@1.77453:340.924103+9.220978 test-tweedie-nloglik@1.77453:340.936652+36.904872 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.77453 for early stopping. ## Will train until test_tweedie_nloglik@1.77453 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.77453:15.810031+0.105204 test-tweedie-nloglik@1.77453:16.556760+0.587635 ## [1001] train-tweedie-nloglik@1.77453:13.707083+0.061060 test-tweedie-nloglik@1.77453:14.973095+0.458299 ## [1501] train-tweedie-nloglik@1.77453:13.042243+0.062065 test-tweedie-nloglik@1.77453:14.767002+0.587105 ## Stopping. Best iteration: ## [1888] train-tweedie-nloglik@1.77453:12.664030+0.058594 test-tweedie-nloglik@1.77453:14.693107+0.666969 ## ## [1] train-tweedie-nloglik@1.81211:341.573523+9.099887 test-tweedie-nloglik@1.81211:341.579096+36.409106 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81211 for early stopping. ## Will train until test_tweedie_nloglik@1.81211 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81211:21.698555+0.287551 test-tweedie-nloglik@1.81211:21.802563+1.318389 ## [1001] train-tweedie-nloglik@1.81211:15.316608+0.057473 test-tweedie-nloglik@1.81211:15.507609+0.267001 ## [1501] train-tweedie-nloglik@1.81211:15.002261+0.055740 test-tweedie-nloglik@1.81211:15.310712+0.234783 ## [2001] train-tweedie-nloglik@1.81211:14.813123+0.056066 test-tweedie-nloglik@1.81211:15.225331+0.251187 ## [2501] train-tweedie-nloglik@1.81211:14.666221+0.055383 test-tweedie-nloglik@1.81211:15.168456+0.269639 ## [3001] train-tweedie-nloglik@1.81211:14.541463+0.053780 test-tweedie-nloglik@1.81211:15.124595+0.291749 ## [3501] train-tweedie-nloglik@1.81211:14.435767+0.053397 test-tweedie-nloglik@1.81211:15.090134+0.312649 ## [4001] train-tweedie-nloglik@1.81211:14.342162+0.053199 test-tweedie-nloglik@1.81211:15.060593+0.328036 ## [4501] train-tweedie-nloglik@1.81211:14.255980+0.053683 test-tweedie-nloglik@1.81211:15.030732+0.343309 ## Stopping. Best iteration: ## [4585] train-tweedie-nloglik@1.81211:14.242503+0.053874 test-tweedie-nloglik@1.81211:15.024390+0.345340 ## ## [1] train-tweedie-nloglik@1.81758:341.854852+9.086251 test-tweedie-nloglik@1.81758:341.863275+36.356180 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81758 for early stopping. ## Will train until test_tweedie_nloglik@1.81758 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81758:25.013295+0.397345 test-tweedie-nloglik@1.81758:25.676643+1.808151 ## [1001] train-tweedie-nloglik@1.81758:14.504071+0.055626 test-tweedie-nloglik@1.81758:15.187462+0.380966 ## [1501] train-tweedie-nloglik@1.81758:13.875887+0.049737 test-tweedie-nloglik@1.81758:14.765052+0.371576 ## [2001] train-tweedie-nloglik@1.81758:13.542809+0.049598 test-tweedie-nloglik@1.81758:14.638334+0.423159 ## Stopping. Best iteration: ## [2270] train-tweedie-nloglik@1.81758:13.398557+0.049952 test-tweedie-nloglik@1.81758:14.601097+0.448479 ## ## [1] train-tweedie-nloglik@1.78506:341.348248+9.195115 test-tweedie-nloglik@1.78506:341.357587+36.789846 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.78506 for early stopping. ## Will train until test_tweedie_nloglik@1.78506 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.78506:18.174807+0.195105 test-tweedie-nloglik@1.78506:19.060744+1.022248 ## [1001] train-tweedie-nloglik@1.78506:13.397231+0.063749 test-tweedie-nloglik@1.78506:14.732965+0.451939 ## Stopping. Best iteration: ## [1460] train-tweedie-nloglik@1.78506:12.673491+0.062628 test-tweedie-nloglik@1.78506:14.487809+0.572331 ## ## [1] train-tweedie-nloglik@1.84189:341.976398+8.988454 test-tweedie-nloglik@1.84189:341.984406+35.983697 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84189 for early stopping. ## Will train until test_tweedie_nloglik@1.84189 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.84189:18.400557+0.165181 test-tweedie-nloglik@1.84189:18.735286+0.850735 ## [1001] train-tweedie-nloglik@1.84189:14.588454+0.047146 test-tweedie-nloglik@1.84189:15.120865+0.271691 ## [1501] train-tweedie-nloglik@1.84189:14.106823+0.046601 test-tweedie-nloglik@1.84189:14.891420+0.316660 ## Stopping. Best iteration: ## [1936] train-tweedie-nloglik@1.84189:13.822906+0.048588 test-tweedie-nloglik@1.84189:14.787369+0.361366 ## ## [1] train-tweedie-nloglik@1.76821:343.904297+9.327030 test-tweedie-nloglik@1.76821:343.909997+37.314570 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.76821 for early stopping. ## Will train until test_tweedie_nloglik@1.76821 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.76821:153.005731+4.071955 test-tweedie-nloglik@1.76821:153.008331+16.291001 ## [1001] train-tweedie-nloglik@1.76821:70.474121+1.777711 test-tweedie-nloglik@1.76821:70.475147+7.112157 ## [1501] train-tweedie-nloglik@1.76821:35.485048+0.776015 test-tweedie-nloglik@1.76821:35.485494+3.104626 ## [2001] train-tweedie-nloglik@1.76821:21.549418+0.338670 test-tweedie-nloglik@1.76821:21.549704+1.354912 ## [2501] train-tweedie-nloglik@1.76821:17.189798+0.147854 test-tweedie-nloglik@1.76821:17.210399+0.612728 ## [3001] train-tweedie-nloglik@1.76821:16.418288+0.094189 test-tweedie-nloglik@1.76821:16.555264+0.433128 ## [3501] train-tweedie-nloglik@1.76821:16.109250+0.081212 test-tweedie-nloglik@1.76821:16.344176+0.387444 ## [4001] train-tweedie-nloglik@1.76821:15.925080+0.078790 test-tweedie-nloglik@1.76821:16.233377+0.369088 ## [4501] train-tweedie-nloglik@1.76821:15.793141+0.077723 test-tweedie-nloglik@1.76821:16.158620+0.370770 ## [5001] train-tweedie-nloglik@1.76821:15.686279+0.075960 test-tweedie-nloglik@1.76821:16.110775+0.376137 ## [5501] train-tweedie-nloglik@1.76821:15.597672+0.076042 test-tweedie-nloglik@1.76821:16.076505+0.388222 ## Stopping. Best iteration: ## [5707] train-tweedie-nloglik@1.76821:15.563175+0.074569 test-tweedie-nloglik@1.76821:16.059571+0.391720 ## ## [1] train-tweedie-nloglik@1.7599:342.328162+9.312946 test-tweedie-nloglik@1.7599:342.334070+37.259847 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.7599 for early stopping. ## Will train until test_tweedie_nloglik@1.7599 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.7599:21.034394+0.261867 test-tweedie-nloglik@1.7599:21.110715+1.265011 ## [1001] train-tweedie-nloglik@1.7599:16.515162+0.077981 test-tweedie-nloglik@1.7599:16.665819+0.360601 ## [1501] train-tweedie-nloglik@1.7599:16.270615+0.075134 test-tweedie-nloglik@1.7599:16.505939+0.319116 ## [2001] train-tweedie-nloglik@1.7599:16.125761+0.075163 test-tweedie-nloglik@1.7599:16.432678+0.320781 ## [2501] train-tweedie-nloglik@1.7599:16.014767+0.073620 test-tweedie-nloglik@1.7599:16.385424+0.331610 ## [3001] train-tweedie-nloglik@1.7599:15.920065+0.071297 test-tweedie-nloglik@1.7599:16.346298+0.346168 ## Stopping. Best iteration: ## [3228] train-tweedie-nloglik@1.7599:15.881553+0.070426 test-tweedie-nloglik@1.7599:16.329813+0.351423 ## ## [1] train-tweedie-nloglik@1.80226:341.820270+9.143934 test-tweedie-nloglik@1.80226:341.825208+36.586124 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80226 for early stopping. ## Will train until test_tweedie_nloglik@1.80226 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80226:26.608221+0.443646 test-tweedie-nloglik@1.80226:26.664685+1.977213 ## [1001] train-tweedie-nloglik@1.80226:15.824172+0.068891 test-tweedie-nloglik@1.80226:15.918101+0.371526 ## [1501] train-tweedie-nloglik@1.80226:15.499650+0.061893 test-tweedie-nloglik@1.80226:15.647616+0.269262 ## [2001] train-tweedie-nloglik@1.80226:15.374731+0.061671 test-tweedie-nloglik@1.80226:15.573182+0.258539 ## [2501] train-tweedie-nloglik@1.80226:15.289750+0.061845 test-tweedie-nloglik@1.80226:15.534944+0.261417 ## Stopping. Best iteration: ## [2579] train-tweedie-nloglik@1.80226:15.278433+0.061600 test-tweedie-nloglik@1.80226:15.529791+0.262305 ## ## [1] train-tweedie-nloglik@1.84974:343.394391+8.997226 test-tweedie-nloglik@1.84974:343.399872+36.001197 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.84974 for early stopping. ## Will train until test_tweedie_nloglik@1.84974 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.84974:35.638326+0.676807 test-tweedie-nloglik@1.84974:35.855040+2.870666 ## [1001] train-tweedie-nloglik@1.84974:15.927161+0.074464 test-tweedie-nloglik@1.84974:16.216283+0.425658 ## [1501] train-tweedie-nloglik@1.84974:14.851506+0.047610 test-tweedie-nloglik@1.84974:15.276635+0.237773 ## [2001] train-tweedie-nloglik@1.84974:14.524470+0.041982 test-tweedie-nloglik@1.84974:15.112846+0.258227 ## [2501] train-tweedie-nloglik@1.84974:14.292380+0.045034 test-tweedie-nloglik@1.84974:15.027395+0.291822 ## [3001] train-tweedie-nloglik@1.84974:14.103252+0.044065 test-tweedie-nloglik@1.84974:14.970938+0.334465 ## Stopping. Best iteration: ## [3384] train-tweedie-nloglik@1.84974:13.977265+0.046385 test-tweedie-nloglik@1.84974:14.943196+0.365468 ## ## [1] train-tweedie-nloglik@1.79106:341.213971+9.169863 test-tweedie-nloglik@1.79106:341.220727+36.687101 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79106 for early stopping. ## Will train until test_tweedie_nloglik@1.79106 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79106:17.835510+0.165372 test-tweedie-nloglik@1.79106:18.048213+0.847644 ## [1001] train-tweedie-nloglik@1.79106:14.868611+0.068062 test-tweedie-nloglik@1.79106:15.391133+0.315593 ## [1501] train-tweedie-nloglik@1.79106:14.301947+0.058922 test-tweedie-nloglik@1.79106:15.099987+0.329528 ## [2001] train-tweedie-nloglik@1.79106:13.923899+0.058205 test-tweedie-nloglik@1.79106:14.931884+0.371135 ## [2501] train-tweedie-nloglik@1.79106:13.617700+0.062158 test-tweedie-nloglik@1.79106:14.828744+0.421567 ## [3001] train-tweedie-nloglik@1.79106:13.364167+0.063944 test-tweedie-nloglik@1.79106:14.745604+0.455195 ## Stopping. Best iteration: ## [3159] train-tweedie-nloglik@1.79106:13.292696+0.064719 test-tweedie-nloglik@1.79106:14.719766+0.463486 ## ## [1] train-tweedie-nloglik@1.75281:343.292597+9.363249 test-tweedie-nloglik@1.75281:343.301770+37.470216 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.75281 for early stopping. ## Will train until test_tweedie_nloglik@1.75281 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.75281:33.808357+0.652361 test-tweedie-nloglik@1.75281:34.678754+3.054606 ## [1001] train-tweedie-nloglik@1.75281:16.164603+0.097345 test-tweedie-nloglik@1.75281:17.212687+0.750414 ## [1501] train-tweedie-nloglik@1.75281:14.603756+0.074138 test-tweedie-nloglik@1.75281:15.953423+0.574503 ## [2001] train-tweedie-nloglik@1.75281:13.968725+0.071299 test-tweedie-nloglik@1.75281:15.654203+0.618075 ## [2501] train-tweedie-nloglik@1.75281:13.520048+0.068583 test-tweedie-nloglik@1.75281:15.482204+0.654811 ## Stopping. Best iteration: ## [2566] train-tweedie-nloglik@1.75281:13.469965+0.068246 test-tweedie-nloglik@1.75281:15.476034+0.661855 ## [mbo] 0: eta=0.00745; gamma=3.51; max_depth=4; min_child_weight=1387; subsample=0.629; colsample_bytree=0.376; max_delta_step=0.64; tweedie_variance_power=1.83 : y = 14.9 : 365.7 secs : initdesign ## [mbo] 0: eta=0.00941; gamma=1.73; max_depth=6; min_child_weight=1729; subsample=0.665; colsample_bytree=0.277; max_delta_step=0.95; tweedie_variance_power=1.78 : y = 14.8 : 401.0 secs : initdesign ## [mbo] 0: eta=0.00594; gamma=2.63; max_depth=9; min_child_weight=1844; subsample=0.406; colsample_bytree=0.723; max_delta_step=4.15; tweedie_variance_power=1.84 : y = 14.7 : 504.6 secs : initdesign ## [mbo] 0: eta=0.00858; gamma=4.15; max_depth=6; min_child_weight=974; subsample=0.772; colsample_bytree=0.686; max_delta_step=3.91; tweedie_variance_power=1.81 : y = 14.6 : 199.2 secs : initdesign ## [mbo] 0: eta=0.00993; gamma=4.57; max_depth=8; min_child_weight=1068; subsample=0.688; colsample_bytree=0.441; max_delta_step=2.05; tweedie_variance_power=1.77 : y = 14.7 : 239.7 secs : initdesign ## [mbo] 0: eta=0.00698; gamma=1.95; max_depth=3; min_child_weight=1116; subsample=0.581; colsample_bytree=0.647; max_delta_step=4.48; tweedie_variance_power=1.81 : y = 15 : 366.1 secs : initdesign ## [mbo] 0: eta=0.00637; gamma=3.72; max_depth=10; min_child_weight=1934; subsample=0.457; colsample_bytree=0.779; max_delta_step=2.48; tweedie_variance_power=1.82 : y = 14.6 : 415.2 secs : initdesign ## [mbo] 0: eta=0.00794; gamma=2.22; max_depth=9; min_child_weight=840; subsample=0.757; colsample_bytree=0.515; max_delta_step=3.11; tweedie_variance_power=1.79 : y = 14.5 : 206.0 secs : initdesign ## [mbo] 0: eta=0.00825; gamma=4.93; max_depth=7; min_child_weight=616; subsample=0.325; colsample_bytree=0.292; max_delta_step=3.4; tweedie_variance_power=1.84 : y = 14.8 : 213.4 secs : initdesign ## [mbo] 0: eta=0.00871; gamma=2.37; max_depth=3; min_child_weight=1553; subsample=0.221; colsample_bytree=0.324; max_delta_step=0.248; tweedie_variance_power=1.77 : y = 16.1 : 359.8 secs : initdesign ## [mbo] 0: eta=0.0073; gamma=4.34; max_depth=2; min_child_weight=1484; subsample=0.526; colsample_bytree=0.618; max_delta_step=1.99; tweedie_variance_power=1.76 : y = 16.3 : 233.8 secs : initdesign ## [mbo] 0: eta=0.00608; gamma=1.07; max_depth=2; min_child_weight=1280; subsample=0.293; colsample_bytree=0.402; max_delta_step=2.82; tweedie_variance_power=1.8 : y = 15.5 : 196.2 secs : initdesign ## [mbo] 0: eta=0.00519; gamma=3.23; max_depth=5; min_child_weight=514; subsample=0.5; colsample_bytree=0.527; max_delta_step=1.51; tweedie_variance_power=1.85 : y = 14.9 : 380.0 secs : initdesign ## [mbo] 0: eta=0.0091; gamma=1.49; max_depth=6; min_child_weight=321; subsample=0.362; colsample_bytree=0.21; max_delta_step=1.12; tweedie_variance_power=1.79 : y = 14.7 : 300.6 secs : initdesign ## [mbo] 0: eta=0.00543; gamma=2.87; max_depth=10; min_child_weight=731; subsample=0.243; colsample_bytree=0.595; max_delta_step=4.85; tweedie_variance_power=1.75 : y = 15.5 : 350.0 secs : initdesign ## [1] train-tweedie-nloglik@1.81205:340.715326+9.076589 test-tweedie-nloglik@1.81205:340.723718+36.318898 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.81205 for early stopping. ## Will train until test_tweedie_nloglik@1.81205 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.81205:15.918866+0.102574 test-tweedie-nloglik@1.81205:16.461088+0.571469 ## [1001] train-tweedie-nloglik@1.81205:13.785034+0.049437 test-tweedie-nloglik@1.81205:14.672956+0.330847 ## [1501] train-tweedie-nloglik@1.81205:13.252387+0.053019 test-tweedie-nloglik@1.81205:14.476279+0.413990 ## [2001] train-tweedie-nloglik@1.81205:12.879603+0.050442 test-tweedie-nloglik@1.81205:14.391401+0.501251 ## Stopping. Best iteration: ## [2023] train-tweedie-nloglik@1.81205:12.864722+0.050294 test-tweedie-nloglik@1.81205:14.387596+0.505142 ## [mbo] 1: eta=0.00954; gamma=1.9; max_depth=8; min_child_weight=1225; subsample=0.731; colsample_bytree=0.349; max_delta_step=2.2; tweedie_variance_power=1.81 : y = 14.4 : 249.3 secs : infill_cb ## [1] train-tweedie-nloglik@1.80087:343.092389+9.184359 test-tweedie-nloglik@1.80087:343.097986+36.745005 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80087 for early stopping. ## Will train until test_tweedie_nloglik@1.80087 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80087:113.222479+2.921018 test-tweedie-nloglik@1.80087:113.224371+11.686362 ## [1001] train-tweedie-nloglik@1.80087:41.464624+0.928879 test-tweedie-nloglik@1.80087:41.465188+3.716334 ## [1501] train-tweedie-nloglik@1.80087:20.439788+0.295234 test-tweedie-nloglik@1.80087:20.440313+1.181873 ## [2001] train-tweedie-nloglik@1.80087:15.711758+0.113962 test-tweedie-nloglik@1.80087:15.898547+0.431068 ## [2501] train-tweedie-nloglik@1.80087:14.325745+0.070814 test-tweedie-nloglik@1.80087:15.013139+0.345294 ## [3001] train-tweedie-nloglik@1.80087:13.676998+0.056270 test-tweedie-nloglik@1.80087:14.686207+0.365084 ## [3501] train-tweedie-nloglik@1.80087:13.235437+0.057295 test-tweedie-nloglik@1.80087:14.506423+0.408045 ## [4001] train-tweedie-nloglik@1.80087:12.878366+0.056469 test-tweedie-nloglik@1.80087:14.409738+0.476704 ## Stopping. Best iteration: ## [4439] train-tweedie-nloglik@1.80087:12.602206+0.057227 test-tweedie-nloglik@1.80087:14.353508+0.526717 ## [mbo] 2: eta=0.00911; gamma=1.77; max_depth=8; min_child_weight=868; subsample=0.791; colsample_bytree=0.549; max_delta_step=0.314; tweedie_variance_power=1.8 : y = 14.4 : 422.2 secs : infill_cb ## [1] train-tweedie-nloglik@1.79891:341.616339+9.151591 test-tweedie-nloglik@1.79891:341.621362+36.614048 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79891 for early stopping. ## Will train until test_tweedie_nloglik@1.79891 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79891:21.260748+0.323534 test-tweedie-nloglik@1.79891:21.276269+1.295378 ## [1001] train-tweedie-nloglik@1.79891:14.082018+0.051754 test-tweedie-nloglik@1.79891:14.875722+0.333678 ## [1501] train-tweedie-nloglik@1.79891:13.233551+0.051804 test-tweedie-nloglik@1.79891:14.484654+0.410128 ## [2001] train-tweedie-nloglik@1.79891:12.712292+0.049849 test-tweedie-nloglik@1.79891:14.331855+0.503634 ## Stopping. Best iteration: ## [2290] train-tweedie-nloglik@1.79891:12.472694+0.053111 test-tweedie-nloglik@1.79891:14.298177+0.575406 ## [mbo] 3: eta=0.00942; gamma=3.15; max_depth=9; min_child_weight=843; subsample=0.783; colsample_bytree=0.271; max_delta_step=0.888; tweedie_variance_power=1.8 : y = 14.3 : 252.1 secs : infill_cb ## [1] train-tweedie-nloglik@1.8023:340.514740+9.107514 test-tweedie-nloglik@1.8023:340.523969+36.446491 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8023 for early stopping. ## Will train until test_tweedie_nloglik@1.8023 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8023:14.909694+0.100825 test-tweedie-nloglik@1.8023:16.029717+0.614204 ## [1001] train-tweedie-nloglik@1.8023:12.276371+0.058075 test-tweedie-nloglik@1.8023:14.299033+0.636373 ## Stopping. Best iteration: ## [1016] train-tweedie-nloglik@1.8023:12.243908+0.059515 test-tweedie-nloglik@1.8023:14.297354+0.646123 ## [mbo] 4: eta=0.00999; gamma=2.8; max_depth=9; min_child_weight=342; subsample=0.683; colsample_bytree=0.49; max_delta_step=1.58; tweedie_variance_power=1.8 : y = 14.3 : 155.0 secs : infill_cb ## [1] train-tweedie-nloglik@1.79965:340.693542+9.123636 test-tweedie-nloglik@1.79965:340.698126+36.502615 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79965 for early stopping. ## Will train until test_tweedie_nloglik@1.79965 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79965:15.513652+0.107246 test-tweedie-nloglik@1.79965:16.291323+0.599968 ## [1001] train-tweedie-nloglik@1.79965:13.127623+0.057521 test-tweedie-nloglik@1.79965:14.478948+0.468251 ## [1501] train-tweedie-nloglik@1.79965:12.450628+0.057462 test-tweedie-nloglik@1.79965:14.320771+0.610130 ## Stopping. Best iteration: ## [1521] train-tweedie-nloglik@1.79965:12.427706+0.057687 test-tweedie-nloglik@1.79965:14.317007+0.616377 ## [mbo] 5: eta=0.01; gamma=2.63; max_depth=10; min_child_weight=1081; subsample=0.698; colsample_bytree=0.493; max_delta_step=1.18; tweedie_variance_power=1.8 : y = 14.3 : 205.2 secs : infill_cb ## [1] train-tweedie-nloglik@1.79543:340.537445+9.134006 test-tweedie-nloglik@1.79543:340.552142+36.558473 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79543 for early stopping. ## Will train until test_tweedie_nloglik@1.79543 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79543:15.423422+0.099615 test-tweedie-nloglik@1.79543:16.142575+0.549508 ## [1001] train-tweedie-nloglik@1.79543:13.201343+0.056114 test-tweedie-nloglik@1.79543:14.503582+0.419961 ## [1501] train-tweedie-nloglik@1.79543:12.487963+0.057690 test-tweedie-nloglik@1.79543:14.305727+0.556182 ## Stopping. Best iteration: ## [1697] train-tweedie-nloglik@1.79543:12.274374+0.058514 test-tweedie-nloglik@1.79543:14.272447+0.606904 ## [mbo] 6: eta=0.00999; gamma=1.42; max_depth=9; min_child_weight=584; subsample=0.769; colsample_bytree=0.272; max_delta_step=2.29; tweedie_variance_power=1.8 : y = 14.3 : 217.2 secs : infill_cb ## [1] train-tweedie-nloglik@1.79417:341.503680+9.166254 test-tweedie-nloglik@1.79417:341.508606+36.672458 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.79417 for early stopping. ## Will train until test_tweedie_nloglik@1.79417 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.79417:19.509591+0.261787 test-tweedie-nloglik@1.79417:19.639546+1.096469 ## [1001] train-tweedie-nloglik@1.79417:13.144418+0.052379 test-tweedie-nloglik@1.79417:14.496626+0.423258 ## [1501] train-tweedie-nloglik@1.79417:12.073966+0.057633 test-tweedie-nloglik@1.79417:14.226620+0.590800 ## Stopping. Best iteration: ## [1514] train-tweedie-nloglik@1.79417:12.054614+0.057727 test-tweedie-nloglik@1.79417:14.224718+0.595632 ## [mbo] 7: eta=0.00991; gamma=1.58; max_depth=10; min_child_weight=453; subsample=0.77; colsample_bytree=0.352; max_delta_step=0.902; tweedie_variance_power=1.79 : y = 14.2 : 182.2 secs : infill_cb ## [1] train-tweedie-nloglik@1.80093:342.887091+9.178540 test-tweedie-nloglik@1.80093:342.892096+36.721835 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80093 for early stopping. ## Will train until test_tweedie_nloglik@1.80093 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80093:85.321825+2.153553 test-tweedie-nloglik@1.80093:85.322955+8.616076 ## [1001] train-tweedie-nloglik@1.80093:26.996887+0.505121 test-tweedie-nloglik@1.80093:26.997190+2.021150 ## [1501] train-tweedie-nloglik@1.80093:16.226591+0.120683 test-tweedie-nloglik@1.80093:16.298045+0.500625 ## [2001] train-tweedie-nloglik@1.80093:14.522975+0.055452 test-tweedie-nloglik@1.80093:15.077755+0.326529 ## [2501] train-tweedie-nloglik@1.80093:13.705078+0.062120 test-tweedie-nloglik@1.80093:14.643362+0.337972 ## [3001] train-tweedie-nloglik@1.80093:13.174421+0.062708 test-tweedie-nloglik@1.80093:14.423376+0.381127 ## [3501] train-tweedie-nloglik@1.80093:12.755801+0.063702 test-tweedie-nloglik@1.80093:14.289940+0.439437 ## Stopping. Best iteration: ## [3829] train-tweedie-nloglik@1.80093:12.516629+0.066593 test-tweedie-nloglik@1.80093:14.231650+0.485118 ## [mbo] 8: eta=0.00908; gamma=1.13; max_depth=10; min_child_weight=301; subsample=0.709; colsample_bytree=0.2; max_delta_step=0.399; tweedie_variance_power=1.8 : y = 14.2 : 391.0 secs : infill_cb ## [1] train-tweedie-nloglik@1.80353:340.954138+9.115118 test-tweedie-nloglik@1.80353:340.965655+36.477642 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping. ## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80353:16.280642+0.145262 test-tweedie-nloglik@1.80353:17.426506+0.818395 ## [1001] train-tweedie-nloglik@1.80353:12.257854+0.055505 test-tweedie-nloglik@1.80353:14.196385+0.579480 ## Stopping. Best iteration: ## [1075] train-tweedie-nloglik@1.80353:12.083582+0.060371 test-tweedie-nloglik@1.80353:14.174686+0.630563 ## [mbo] 9: eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8 : y = 14.2 : 165.5 secs : infill_cb ## [1] train-tweedie-nloglik@1.8168:341.102509+9.068973 test-tweedie-nloglik@1.8168:341.113751+36.288728 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.8168 for early stopping. ## Will train until test_tweedie_nloglik@1.8168 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.8168:16.397836+0.150262 test-tweedie-nloglik@1.8168:17.661101+0.874430 ## [1001] train-tweedie-nloglik@1.8168:12.316689+0.061103 test-tweedie-nloglik@1.8168:14.303786+0.625352 ## Stopping. Best iteration: ## [1019] train-tweedie-nloglik@1.8168:12.277460+0.060180 test-tweedie-nloglik@1.8168:14.299007+0.638777 ## [mbo] 10: eta=0.00858; gamma=1.26; max_depth=10; min_child_weight=467; subsample=0.798; colsample_bytree=0.641; max_delta_step=1.38; tweedie_variance_power=1.82 : y = 14.3 : 170.7 secs : infill_cb ## $eta ## [1] 0.008678109 ## ## $gamma ## [1] 1.423488 ## ## $max_depth ## [1] 10 ## ## $min_child_weight ## [1] 302 ## ## $subsample ## [1] 0.7986374 ## ## $colsample_bytree ## [1] 0.3759013 ## ## $max_delta_step ## [1] 1.391834 ## ## $tweedie_variance_power ## [1] 1.803532
Diagnostics and evaluating result
Results in hand, we want to check some diagnostics starting with the objective function evaluation for all of the runs.
runs$plot
The plot above (see our do_bayes()
function for how we extracted this info) shows the best test evaluation result for each run - the initial design is colored red and the optimization runs are in blue. The hyperparameter values which produced those evaluations were the ones chosen through kriging. We can see from this that none of the random evaluations gave a top result, but together they did provide solid information on where in the hyperparameter space we should focus our search in order to optimize the algorithm. Every subsequent proposal was better than all of the random ones.
The “default” viz below comes from the plot()
S3 class method for MBOSingleObjResult
and shows a few useful things, although the formatting could be improved. Most importantly, the top left plot shows the “scaled” values for each set of hyperparameters, for each run. Use this to confirm your recommended solution does not include any hyperparameter at the boundary of the values tested - if it does, then expand the range of that parameter in your objective function and re-run. In my example below, you can see that the optimal solution (the green line) includes a value for max_depth
at the maximum of 10, and a min_child_weight
at or near the minimum (300) of the range I had allowed. Unless I were intentionally using these bounds to limit model complexity and improve generalization, I should try expanding the ranges of these hyperparameters and running again.
class(runs$run) %>% print ## [1] "MBOSingleObjResult" "MBOResult" plot(runs$run)
If you print the result object you can confirm the recommended solution included these boundary values:
print(runs$run) ## Recommended parameters: ## eta=0.00868; gamma=1.42; max_depth=10; min_child_weight=302; subsample=0.799; colsample_bytree=0.376; max_delta_step=1.39; tweedie_variance_power=1.8 ## Objective: y = 14.175 ## ## Optimization path ## 15 + 10 entries in total, displaying last 10 (or less): ## eta gamma max_depth min_child_weight subsample colsample_bytree max_delta_step tweedie_variance_power ## 16 0.009543908 1.904103 8 1225 0.7305367 0.3488211 2.1958369 1.812051 ## 17 0.009114593 1.773073 8 868 0.7911231 0.5494522 0.3138641 1.800874 ## 18 0.009421640 3.150876 9 843 0.7830561 0.2705329 0.8878380 1.798907 ## 19 0.009990520 2.801814 9 342 0.6832292 0.4898784 1.5775174 1.802298 ## 20 0.009996788 2.626826 10 1081 0.6979587 0.4934878 1.1786900 1.799649 ## 21 0.009986297 1.419945 9 584 0.7690373 0.2716777 2.2940200 1.795434 ## 22 0.009913078 1.579795 10 453 0.7699498 0.3517838 0.9017585 1.794166 ## 23 0.009082225 1.130765 10 301 0.7085142 0.2004467 0.3985858 1.800927 ## 24 0.008678109 1.423488 10 302 0.7986374 0.3759013 1.3918335 1.803532 ## 25 0.008579770 1.258053 10 467 0.7982686 0.6407762 1.3812183 1.816803 ## y dob eol error.message exec.time cb error.model train.time prop.type propose.time se mean ## 16 14.38760 1 NA 249.294 14.08443 0.417 infill_cb 1.041 0.26275330 14.34718 ## 17 14.35351 2 NA 422.219 14.20226 0.150 infill_cb 1.771 0.21492201 14.41718 ## 18 14.29818 3 NA 252.086 14.20628 0.123 infill_cb 1.537 0.19145391 14.39773 ## 19 14.29735 4 NA 155.010 14.19001 0.174 infill_cb 1.474 0.15554575 14.34556 ## 20 14.31701 5 NA 205.205 14.19583 0.333 infill_cb 1.299 0.15011713 14.34595 ## 21 14.27245 6 NA 217.224 14.20543 0.122 infill_cb 1.356 0.12818859 14.33362 ## 22 14.22472 7 NA 182.158 14.19185 0.101 infill_cb 1.510 0.10787970 14.29973 ## 23 14.23165 8 NA 390.982 14.16640 0.180 infill_cb 1.438 0.12182032 14.28822 ## 24 14.17469 9 NA 165.500 14.16323 0.113 infill_cb 1.142 0.08824657 14.25147 ## 25 14.29901 10 NA 170.738 14.12875 0.097 infill_cb 1.179 0.14723428 14.27598 ## lambda ## 16 1 ## 17 1 ## 18 1 ## 19 1 ## 20 1 ## 21 1 ## 22 1 ## 23 1 ## 24 1 ## 25 1
Using the result
Assuming we are happy with the result, we should then have what we need to proceed with model training. However, since xgb.cv()
now uses early stopping and nrounds
is not a tuning parameter, we did not capture this needed information in our MBO result. So we need to run one more evaluation the old-fashioned way, calling xgb.cv()
directly using the best hyperparameters we found.
best.params <- runs$run$x print(best.params) ## $eta ## [1] 0.008678109 ## ## $gamma ## [1] 1.423488 ## ## $max_depth ## [1] 10 ## ## $min_child_weight ## [1] 302 ## ## $subsample ## [1] 0.7986374 ## ## $colsample_bytree ## [1] 0.3759013 ## ## $max_delta_step ## [1] 1.391834 ## ## $tweedie_variance_power ## [1] 1.803532
We add the model parameters which were fixed during optimization to this list:
best.params$booster <- "gbtree" best.params$objective <- "reg:tweedie"
Now we cross-validate the number of rounds to use, fixing our best hyperparameters:
optimal.cv <- xgb.cv(params = best.params, data = fre_dm, nrounds = 6000, nthread = 26, nfold = 5, prediction = FALSE, showsd = TRUE, early_stopping_rounds = 25, verbose = 1, print_every_n = 500) ## [1] train-tweedie-nloglik@1.80353:340.950989+7.692406 test-tweedie-nloglik@1.80353:340.929736+30.743659 ## Multiple eval metrics are present. Will use test_tweedie_nloglik@1.80353 for early stopping. ## Will train until test_tweedie_nloglik@1.80353 hasn't improved in 25 rounds. ## ## [501] train-tweedie-nloglik@1.80353:16.281491+0.131605 test-tweedie-nloglik@1.80353:17.436350+0.607008 ## [1001] train-tweedie-nloglik@1.80353:12.254035+0.041286 test-tweedie-nloglik@1.80353:14.250821+0.548260 ## Stopping. Best iteration: ## [1060] train-tweedie-nloglik@1.80353:12.114213+0.035136 test-tweedie-nloglik@1.80353:14.242658+0.593849
Obtain the best number of rounds…
best.params$nrounds <- optimal.cv$best_ntreelimit best.params[[11]] %>% print ## [1] 1060
…and finally, train the final learner:
final.model <- xgboost(params = best.params[-11], ## do not include nrounds here data = fre_dm, nrounds = best.params$nrounds, verbose = 1, print_every_n = 500) ## [1] train-tweedie-nloglik@1.80353:340.952393 ## [501] train-tweedie-nloglik@1.80353:16.252325 ## [1001] train-tweedie-nloglik@1.80353:12.189503 ## [1060] train-tweedie-nloglik@1.80353:12.041411 xgb.importance(model = final.model) %>% xgb.plot.importance()
Conclusion
Bayesian optimization is a smart approach for tuning more complex learning algorithms with many hyperparameters when compute resources are slowing down the analysis. It is commonly used in deep learning, but can also be useful to when working with machine learning algorithms like GBMs (shown here), random forests, support vector machines - really anything that is going to take you too much time to run a naive grid search. Even if you are working with a relatively simple algorithm - say a lasso regression, which involves a single hyperparameter \(\lambda\) to control the shrinkage/penalty - you may have just a small amount of compute available to you. If so, then it could still make sense to use MBO and cut down the number of evaluations needed to find the optimum.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.