Using bayesian optimisation to tune a XGBOOST model in R

R'tichoke

2 months ago

[This article was first published on R'tichoke, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My first post in 2022! A very happy new year to anyone reading this. ?

I was looking for a simple and effective way to tune xgboost models in R and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost model.

# Pacman is a package management tool 
install.packages("pacman")

library(pacman)

# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)

Data prep

# Load up some data
data("BostonHousing2")

# Data summary
skim(BostonHousing2)

Table: Table 1: Data summary


Name	BostonHousing2
Number of rows	506
Number of columns	19
_______________________
Column type frequency:
factor	2
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
town	0	1	FALSE	92	Cam: 30, Bos: 23, Lyn: 22, Bos: 19
chas	0	1	FALSE	2	0: 471, 1: 35

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tract	1	2700.36	1380.04	1.00	1303.25	3393.50	3739.75	5082.00	▅▂▂▇▂
lon	1	-71.06	0.08	-71.29	-71.09	-71.05	-71.02	-70.81	▁▂▇▂▁
lat	1	42.22	0.06	42.03	42.18	42.22	42.25	42.38	▁▃▇▃▁
medv	1	22.53	9.20	5.00	17.02	21.20	25.00	50.00	▂▇▅▁▁
cmedv	1	22.53	9.18	5.00	17.02	21.20	25.00	50.00	▂▇▅▁▁
crim	1	3.61	8.60	0.01	0.08	0.26	3.68	88.98	▇▁▁▁▁
zn	1	11.36	23.32	0.00	0.00	0.00	12.50	100.00	▇▁▁▁▁
indus	1	11.14	6.86	0.46	5.19	9.69	18.10	27.74	▇▆▁▇▁
nox	1	0.55	0.12	0.38	0.45	0.54	0.62	0.87	▇▇▆▅▁
rm	1	6.28	0.70	3.56	5.89	6.21	6.62	8.78	▁▂▇▂▁
age	1	68.57	28.15	2.90	45.02	77.50	94.07	100.00	▂▂▂▃▇
dis	1	3.80	2.11	1.13	2.10	3.21	5.19	12.13	▇▅▂▁▁
rad	1	9.55	8.71	1.00	4.00	5.00	24.00	24.00	▇▂▁▁▃
tax	1	408.24	168.54	187.00	279.00	330.00	666.00	711.00	▇▇▃▁▇
ptratio	1	18.46	2.16	12.60	17.40	19.05	20.20	22.00	▁▃▅▅▇
b	1	356.67	91.29	0.32	375.38	391.44	396.22	396.90	▁▁▁▁▇
lstat	1	12.65	7.14	1.73	6.95	11.36	16.96	37.97	▇▇▅▂▁

Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes package to one-hot encode them.

# Predicting median house prices
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
  
  # Collapse categories where population is < 3%
  step_other(town, chas, threshold = .03, other = "Other") %>% 
  
  # Create dummy variables for all factor variables 
  step_dummy(all_nominal_predictors())

# Train the recipe on the data set
prep <- prep(rec, training = BostonHousing2)

# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)

# All levels have been one hot encoded and separate columns have been appended to the model matrix
colnames(model_df)
##  [1] "tract"                  "lon"                    "lat"                   
##  [4] "medv"                   "crim"                   "zn"                    
##  [7] "indus"                  "nox"                    "rm"                    
## [10] "age"                    "dis"                    "rad"                   
## [13] "tax"                    "ptratio"                "b"                     
## [16] "lstat"                  "cmedv"                  "town_Boston.Savin.Hill"
## [19] "town_Cambridge"         "town_Lynn"              "town_Newton"           
## [22] "town_Other"             "chas_X1"

Next, we can use the resample package to create test/train splits.

splits <- rsample::initial_split(model_df, prop = 0.7)

# Training set
train_df <- rsample::training(splits)

# Test set
test_df <- rsample::testing(splits)

dim(train_df)
## [1] 354  23

dim(test_df)
## [1] 152  23

Finding optimal parameters

Now we can start to run some optimisations using the ParBayesianOptimization package.

# The xgboost interface accepts matrices 
X <- train_df %>%
  # Remove the target variable
  select(!medv, !cmedv) %>%
  as.matrix()

# Get the target variable
y <- train_df %>%
  pull(cmedv)

# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
              fold2 = as.integer(seq(2, nrow(X), by = 5)))

We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv() as the value that needs to be optimised.

# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
  
  param <- list(
    
    # Hyter parameters 
    eta = eta,
    max_depth = max_depth,
    min_child_weight = min_child_weight,
    subsample = subsample,
    lambda = lambda,
    alpha = alpha,
    
    # Tree model 
    booster = "gbtree",
    
    # Regression problem 
    objective = "reg:squarederror",
    
    # Use the Mean Absolute Percentage Error
    eval_metric = "mape")
  
  xgbcv <- xgb.cv(params = param,
                  data = X,
                  label = y,
                  nround = 50,
                  folds = folds,
                  prediction = TRUE,
                  early_stopping_rounds = 5,
                  verbose = 0,
                  maximize = F)
  
  lst <- list(
    
    # First argument must be named as "Score"
    # Function finds maxima so inverting the output
    Score = -min(xgbcv$evaluation_log$test_mape_mean),
    
    # Get number of trees for the best performing model
    nrounds = xgbcv$best_iteration
  )
  
  return(lst)
}

Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.

bounds <- list(eta = c(0.001, 0.2),
               max_depth = c(1L, 10L),
               min_child_weight = c(1, 50),
               subsample = c(0.1, 1),
               lambda = c(1, 10),
               alpha = c(1, 10))

We can now run the optimiser to find a set of optimal hyper-parameters.

set.seed(1234)
bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3)

# Show relevant columns from the summary object 
bayes_out$scoreSummary[1:5, c(3:8, 13)]
##           eta max_depth min_child_weight subsample   lambda    alpha      Score
## 1: 0.13392137         8         4.913332 0.2105925 4.721124 3.887629 -0.0902970
## 2: 0.19400811         2        25.454160 0.9594105 9.329695 3.173695 -0.1402720
## 3: 0.16079775         2        14.035652 0.5118349 1.229953 5.093530 -0.1475580
## 4: 0.08957707         4        12.534842 0.3844404 4.358837 1.788342 -0.1410245
## 5: 0.02876388         4        36.586761 0.8107181 6.137100 6.039125 -0.3061535

# Get best parameters
data.frame(getBestPars(bayes_out))
##         eta max_depth min_child_weight subsample lambda alpha
## 1 0.1905414         8         1.541476 0.8729207      1     1

Fitting the model

We can now fit a model and check how well these parameters work.

# Combine best params with base params
opt_params <- append(list(booster = "gbtree", 
                          objective = "reg:squarederror", 
                          eval_metric = "mae"), 
                     getBestPars(bayes_out))

# Run cross validation 
xgbcv <- xgb.cv(params = opt_params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

# Get optimal number of rounds
nrounds = xgbcv$best_iteration

# Fit a xgb model
mdl <- xgboost(data = X, label = y, 
               params = opt_params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)

# Evaluate performance 
actuals <- test_df$cmedv
predicted <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)

# Compute MAPE
mean(abs(actuals - predicted)/actuals)
## [1] 0.008391282

Compare with grid search

grd <- expand.grid(
  eta = seq(0.001, 0.2, length.out = 5),
  max_depth = seq(2L, 10L, by = 1),
  min_child_weight = seq(1, 25, length.out = 3),
  subsample = c(0.25, 0.5, 0.75, 1),
  lambda = c(1, 5, 10),
  alpha = c(1, 5, 10))

dim(grd)

grd_out <- apply(grd, 1, function(par){
  
    par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae"))
    mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0)
    lst <- data.frame(par, score = mdl$best_score)

    return(lst)
  })

grd_out <- do.call(rbind, grd_out)

best_par <- grd_out %>%
  data.frame() %>%
  arrange(score) %>%
  .[1,]

# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

nrounds = xgbcv$best_iteration

mdl <- xgboost(data = X, 
               label = y, 
               params = params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)

# Evaluate on test set
act <- test_df$medv
pred <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)

mean(abs(act - pred)/act)
## [1] 0.009407723

While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.

Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!

To leave a comment for the author, please follow the link and comment on their blog: R'tichoke.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.