Bayesian Optimization of Machine Learning Models
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Max Kuhn: Director, Nonclinical Statistics, Pfizer
Many predictive and machine learning models have structural or tuning parameters that cannot be directly estimated from the data. For example, when using K-nearest neighbor model, there is no analytical estimator for K (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for K and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.
To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. x’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:
x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 + I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) + sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) + I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20
The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the caret package via a function called SLC14_1. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:
> library(caret) > set.seed(7210) > train_dat <- SLC14_1(250) > large_dat <- SLC14_1(10000)
We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as sigma. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values:
There is a wide range of parameter values that are associated with very low RMSE values in the northwest.
A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, caret’s train function defines wide as cost values between 2^c(-5, 10) and sigma values inside the range produced by the sigest function in the kernlab package. This code will do 20 random sub-models in this range:
> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+                           search = "random")
> 
> set.seed(308) 
> rand_search <- train(y ~ ., data = train_dat,
+                      method = "svmRadial",
+                      ## Create 20 random parameter values
+                      tuneLength = 20,
+                      metric = "RMSE",
+                      preProc = c("center", "scale"),
+                      trControl = rand_ctrl)
> rand_search
Support Vector Machines with Radial Basis Function Kernel 
250 samples
 20 predictor
Pre-processing: centered (20), scaled (20) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ... 
Resampling results across tuning parameters:
  sigma       C             RMSE      Rsquared 
  0.01161955   42.75789360  10.50838  0.7299837
  0.01357777   67.97672171  10.71276  0.7212605
  0.01392676  828.08072944  10.75235  0.7195869
  0.01394119    0.18386619  18.56921  0.2109284
  0.01538656    0.05224914  19.33310  0.1890599
  0.01711920  228.59215128  11.09522  0.7047713
  0.01790202    0.78835920  16.78597  0.3217203
  0.01936110    0.91401289  16.45485  0.3492278
  0.02023763    0.07658831  19.03987  0.2081059
  0.02690269    0.04128731  19.33974  0.2126950
  0.02780880    0.64865483  16.52497  0.3545042
  0.02920113  974.08943821  12.22906  0.6508754
  0.02963586    1.19350198  15.46690  0.4407725
  0.03370625   31.45179445  12.60653  0.6314384
  0.03561750    0.04970422  19.23564  0.2306298
  0.03752561    0.06592800  19.07130  0.2375616
  0.03783570  398.44599747  12.92958  0.6143790
  0.04534046    3.91017571  13.56612  0.5798001
  0.05171719  296.65916049  13.88865  0.5622445
  0.06482201   47.31716568  14.66904  0.5192667
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.01161955 and C
 = 42.75789. 
> ggplot(rand_search) + scale_x_log10() + scale_y_log10()
 
> getTrainPerf(rand_search)
  TrainRMSE TrainRsquared    method
1  10.50838     0.7299837 svmRadial
There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and sigma. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.
Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.
In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and sigma parameters. The Bayesian machinery allows of this prediction to have a distribution; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like:
The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is:
The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.
(The GPfit package was used to create these models.)
To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the confidence bound approach. For any combination of cost and sigma, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean:
Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.
Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:
> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
> 
> ## Use this function to optimize the model. The two parameters are 
> ## evaluated on the log scale given their range and scope. 
> svm_fit_bayes <- function(logC, logSigma) {
+   ## Use the same model code but for a single (C, sigma) pair. 
+   txt <- capture.output(
+     mod <- train(y ~ ., data = train_dat,
+                  method = "svmRadial",
+                  preProc = c("center", "scale"),
+                  metric = "RMSE",
+                  trControl = ctrl,
+                  tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+   )
+   ## The function wants to _maximize_ the outcome so we return 
+   ## the negative of the resampled RMSE value. `Pred` can be used
+   ## to return predicted values but we'll avoid that and use zero
+   list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
> 
> ## Define the bounds of the search. 
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds[1], upper_bounds[1]),
+                logSigma = c(lower_bounds[2], upper_bounds[2]))
> 
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")]
> initial_grid$C <- log(initial_grid$C)
> initial_grid$sigma <- log(initial_grid$sigma)
> initial_grid$RMSE <- -initial_grid$RMSE
> names(initial_grid) <- c("logC", "logSigma", "Value")
> 
> ## Run the optimization with the initial grid and do
> ## 30 iterations. We will choose new parameter values
> ## using the upper confidence bound using 1 std. dev. 
> 
> library(rBayesianOptimization)
> 
> set.seed(8606)
> ba_search <- BayesianOptimization(svm_fit_bayes,
+                                   bounds = bounds,
+                                   init_grid_dt = initial_grid, 
+                                   init_points = 0, 
+                                   n_iter = 30,
+                                   acq = "ucb", 
+                                   kappa = 1, 
+                                   eps = 0.0,
+                                   verbose = TRUE)
20 points in hyperparameter space were pre-sampled
elapsed = 1.53  	Round = 21  logC = 5.4014   logSigma = -5.8974  Value = -10.8148 
elapsed = 1.54  	Round = 22  logC = 4.9757   logSigma = -5.0449  Value = -9.7936 
elapsed = 1.42  	Round = 23  logC = 5.7551   logSigma = -5.0244  Value = -9.8128 
elapsed = 1.30  	Round = 24  logC = 5.2754   logSigma = -4.9678  Value = -9.7530 
elapsed = 1.39  	Round = 25  logC = 5.3009   logSigma = -5.0921  Value = -9.5516 
elapsed = 1.48  	Round = 26  logC = 5.3240   logSigma = -5.2313  Value = -9.6571 
elapsed = 1.39  	Round = 27  logC = 5.3750   logSigma = -5.1152  Value = -9.6619 
elapsed = 1.44  	Round = 28  logC = 5.2356   logSigma = -5.0969  Value = -9.4167 
elapsed = 1.38  	Round = 29  logC = 11.8347  logSigma = -5.1074  Value = -9.6351 
elapsed = 1.42  	Round = 30  logC = 15.7494  logSigma = -5.1232  Value = -9.4243 
elapsed = 25.24 	Round = 31  logC = 14.6657  logSigma = -7.9164  Value = -8.8410 
elapsed = 32.60 	Round = 32  logC = 18.3793  logSigma = -8.1083  Value = -8.7139 
elapsed = 1.86  	Round = 33  logC = 20.0000  logSigma = -5.6297  Value = -9.0580 
elapsed = 0.97  	Round = 34  logC = 20.0000  logSigma = -1.5768  Value = -19.2183 
elapsed = 5.92  	Round = 35  logC = 17.3827  logSigma = -6.6880  Value = -9.0224 
elapsed = 18.01 	Round = 36  logC = 20.0000  logSigma = -7.6071  Value = -8.5728 
elapsed = 114.49	Round = 37  logC = 16.0079  logSigma = -9.0000  Value = -8.7058 
elapsed = 89.31		Round = 38  logC = 12.8319  logSigma = -9.0000  Value = -8.6799 
elapsed = 99.29		Round = 39  logC = 20.0000  logSigma = -9.0000  Value = -8.5596 
elapsed = 106.88	Round = 40  logC = 14.1190  logSigma = -9.0000  Value = -8.5150 
elapsed = 4.84		Round = 41  logC = 13.4694  logSigma = -6.5271  Value = -8.9728 
elapsed = 108.37	Round = 42  logC = 19.0216  logSigma = -9.0000  Value = -8.7461 
elapsed = 52.43 	Round = 43  logC = 13.5273  logSigma = -8.5130  Value = -8.8728 
elapsed = 39.69 	Round = 44  logC = 20.0000  logSigma = -8.3288  Value = -8.4956 
elapsed = 5.99  	Round = 45  logC = 20.0000  logSigma = -6.7208  Value = -8.9455 
elapsed = 113.01	Round = 46  logC = 14.9611  logSigma = -9.0000  Value = -8.7576 
elapsed = 27.45 	Round = 47  logC = 19.6181  logSigma = -7.9872  Value = -8.6186 
elapsed = 116.00	Round = 48  logC = 17.3060  logSigma = -9.0000  Value = -8.6820 
elapsed = 2.26  	Round = 49  logC = 14.2698  logSigma = -5.8297  Value = -9.1837 
elapsed = 64.50 	Round = 50  logC = 20.0000  logSigma = -8.6438  Value = -8.6914 
Best Parameters Found: 
Round = 44  logC = 20.0000  logSigma = -8.3288  Value = -8.4956 
Animate the search!
The final settings were found at iteration 44 with a cost setting of 485,165,195 and sigma=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work?
We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set:
> set.seed(308)
> final_search <- train(y ~ ., data = train_dat,
+                       method = "svmRadial",
+                       tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]), 
+                                             sigma = exp(ba_search$Best_Par["logSigma"])),
+                       metric = "RMSE",
+                       preProc = c("center", "scale"),
+                       trControl = ctrl)
> compare_models(final_search, rand_search)
    One Sample t-test
data:  x
t = -9.0833, df = 49, p-value = 4.431e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -2.34640 -1.49626
sample estimates:
mean of x 
 -1.92133 
> postResample(predict(rand_search, large_dat), large_dat$y)
      RMSE   Rsquared 
10.1112280  0.7648765 
> postResample(predict(final_search, large_dat), large_dat$y)
     RMSE  Rsquared 
8.2668843 0.8343405 
Much Better!
Thanks to Yachen Yan for making the rBayesianOptimization package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
