Using bayesian optimisation to tune a XGBOOST model in R

Posted on January 7, 2022 by R'tichoke in R bloggers | 0 Comments

[This article was first published on R'tichoke, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My first post in 2022! A very happy new year to anyone reading this. ?

I was looking for a simple and effective way to tune xgboost models in R and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost model.

# Pacman is a package management tool 
install.packages("pacman")

library(pacman)

# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)

Data prep

# Load up some data
data("BostonHousing2")

# Data summary
skim(BostonHousing2)

Table: Table 1: Data summary


Name	BostonHousing2
Number of rows	506
Number of columns	19
_______________________
Column type frequency:
factor	2
numeric	17
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
town	0	1	FALSE	92	Cam: 30, Bos: 23, Lyn: 22, Bos: 19
chas	0	1	FALSE	2	0: 471, 1: 35

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tract	1	2700.36	1380.04	1.00	1303.25	3393.50	3739.75	5082.00	▅▂▂▇▂
lon	1	-71.06	0.08	-71.29	-71.09	-71.05	-71.02	-70.81	▁▂▇▂▁
lat	1	42.22	0.06	42.03	42.18	42.22	42.25	42.38	▁▃▇▃▁
medv	1	22.53	9.20	5.00	17.02	21.20	25.00	50.00	▂▇▅▁▁
cmedv	1	22.53	9.18	5.00	17.02	21.20	25.00	50.00	▂▇▅▁▁
crim	1	3.61	8.60	0.01	0.08	0.26	3.68	88.98	▇▁▁▁▁
zn	1	11.36	23.32	0.00	0.00	0.00	12.50	100.00	▇▁▁▁▁
indus	1	11.14	6.86	0.46	5.19	9.69	18.10	27.74	▇▆▁▇▁
nox	1	0.55	0.12	0.38	0.45	0.54	0.62	0.87	▇▇▆▅▁
rm	1	6.28	0.70	3.56	5.89	6.21	6.62	8.78	▁▂▇▂▁
age	1	68.57	28.15	2.90	45.02	77.50	94.07	100.00	▂▂▂▃▇
dis	1	3.80	2.11	1.13	2.10	3.21	5.19	12.13	▇▅▂▁▁
rad	1	9.55	8.71	1.00	4.00	5.00	24.00	24.00	▇▂▁▁▃
tax	1	408.24	168.54	187.00	279.00	330.00	666.00	711.00	▇▇▃▁▇
ptratio	1	18.46	2.16	12.60	17.40	19.05	20.20	22.00	▁▃▅▅▇
b	1	356.67	91.29	0.32	375.38	391.44	396.22	396.90	▁▁▁▁▇
lstat	1	12.65	7.14	1.73	6.95	11.36	16.96	37.97	▇▇▅▂▁

Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes package to one-hot encode them.

# Predicting median house prices
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
  
  # Collapse categories where population is < 3%
  step_other(town, chas, threshold = .03, other = "Other") %>% 
  
  # Create dummy variables for all factor variables 
  step_dummy(all_nominal_predictors())

# Train the recipe on the data set
prep <- prep(rec, training = BostonHousing2)

# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)

# All levels have been one hot encoded and separate columns have been appended to the model matrix
colnames(model_df)
##  [1] "tract"                  "lon"                    "lat"                   
##  [4] "medv"                   "crim"                   "zn"                    
##  [7] "indus"                  "nox"                    "rm"                    
## [10] "age"                    "dis"                    "rad"                   
## [13] "tax"                    "ptratio"                "b"                     
## [16] "lstat"                  "cmedv"                  "town_Boston.Savin.Hill"
## [19] "town_Cambridge"         "town_Lynn"              "town_Newton"           
## [22] "town_Other"             "chas_X1"

Next, we can use the resample package to create test/train splits.

splits <- rsample::initial_split(model_df, prop = 0.7)

# Training set
train_df <- rsample::training(splits)

# Test set
test_df <- rsample::testing(splits)

dim(train_df)
## [1] 354  23

dim(test_df)
## [1] 152  23

Finding optimal parameters

Now we can start to run some optimisations using the ParBayesianOptimization package.

# The xgboost interface accepts matrices 
X <- train_df %>%
  # Remove the target variable
  select(!medv, !cmedv) %>%
  as.matrix()

# Get the target variable
y <- train_df %>%
  pull(cmedv)

# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
              fold2 = as.integer(seq(2, nrow(X), by = 5)))

We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv() as the value that needs to be optimised.

# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
  
  param <- list(
    
    # Hyter parameters 
    eta = eta,
    max_depth = max_depth,
    min_child_weight = min_child_weight,
    subsample = subsample,
    lambda = lambda,
    alpha = alpha,
    
    # Tree model 
    booster = "gbtree",
    
    # Regression problem 
    objective = "reg:squarederror",
    
    # Use the Mean Absolute Percentage Error
    eval_metric = "mape")
  
  xgbcv <- xgb.cv(params = param,
                  data = X,
                  label = y,
                  nround = 50,
                  folds = folds,
                  prediction = TRUE,
                  early_stopping_rounds = 5,
                  verbose = 0,
                  maximize = F)
  
  lst <- list(
    
    # First argument must be named as "Score"
    # Function finds maxima so inverting the output
    Score = -min(xgbcv$evaluation_log$test_mape_mean),
    
    # Get number of trees for the best performing model
    nrounds = xgbcv$best_iteration
  )
  
  return(lst)
}

Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.

bounds <- list(eta = c(0.001, 0.2),
               max_depth = c(1L, 10L),
               min_child_weight = c(1, 50),
               subsample = c(0.1, 1),
               lambda = c(1, 10),
               alpha = c(1, 10))

We can now run the optimiser to find a set of optimal hyper-parameters.

set.seed(1234)
bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3)

# Show relevant columns from the summary object 
bayes_out$scoreSummary[1:5, c(3:8, 13)]
##           eta max_depth min_child_weight subsample   lambda    alpha      Score
## 1: 0.13392137         8         4.913332 0.2105925 4.721124 3.887629 -0.0902970
## 2: 0.19400811         2        25.454160 0.9594105 9.329695 3.173695 -0.1402720
## 3: 0.16079775         2        14.035652 0.5118349 1.229953 5.093530 -0.1475580
## 4: 0.08957707         4        12.534842 0.3844404 4.358837 1.788342 -0.1410245
## 5: 0.02876388         4        36.586761 0.8107181 6.137100 6.039125 -0.3061535

# Get best parameters
data.frame(getBestPars(bayes_out))
##         eta max_depth min_child_weight subsample lambda alpha
## 1 0.1905414         8         1.541476 0.8729207      1     1

Fitting the model

We can now fit a model and check how well these parameters work.

# Combine best params with base params
opt_params <- append(list(booster = "gbtree", 
                          objective = "reg:squarederror", 
                          eval_metric = "mae"), 
                     getBestPars(bayes_out))

# Run cross validation 
xgbcv <- xgb.cv(params = opt_params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

# Get optimal number of rounds
nrounds = xgbcv$best_iteration

# Fit a xgb model
mdl <- xgboost(data = X, label = y, 
               params = opt_params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)

# Evaluate performance 
actuals <- test_df$cmedv
predicted <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)

# Compute MAPE
mean(abs(actuals - predicted)/actuals)
## [1] 0.008391282

Compare with grid search

grd <- expand.grid(
  eta = seq(0.001, 0.2, length.out = 5),
  max_depth = seq(2L, 10L, by = 1),
  min_child_weight = seq(1, 25, length.out = 3),
  subsample = c(0.25, 0.5, 0.75, 1),
  lambda = c(1, 5, 10),
  alpha = c(1, 5, 10))

dim(grd)

grd_out <- apply(grd, 1, function(par){
  
    par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae"))
    mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0)
    lst <- data.frame(par, score = mdl$best_score)

    return(lst)
  })

grd_out <- do.call(rbind, grd_out)

best_par <- grd_out %>%
  data.frame() %>%
  arrange(score) %>%
  .[1,]

# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

nrounds = xgbcv$best_iteration

mdl <- xgboost(data = X, 
               label = y, 
               params = params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)

# Evaluate on test set
act <- test_df$medv
pred <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)

mean(abs(act - pred)/act)
## [1] 0.009407723

While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.

Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!

To leave a comment for the author, please follow the link and comment on their blog: R'tichoke.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Using bayesian optimisation to tune a XGBOOST model in R

Data prep

Finding optimal parameters

Fitting the model

Compare with grid search

Related

Data prep

Finding optimal parameters

Fitting the model

Compare with grid search

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)