Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My first post in 2022! A very happy new year to anyone reading this. ?
I was looking for a simple and effective way to tune xgboost models in R and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost model.
# Pacman is a package management tool
install.packages("pacman")
library(pacman)
# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)
Data prep
# Load up some data
data("BostonHousing2")
# Data summary
skim(BostonHousing2)
Table: Table 1: Data summary
| Name | BostonHousing2 |
| Number of rows | 506 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| town | 0 | 1 | FALSE | 92 | Cam: 30, Bos: 23, Lyn: 22, Bos: 19 |
| chas | 0 | 1 | FALSE | 2 | 0: 471, 1: 35 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| tract | 0 | 1 | 2700.36 | 1380.04 | 1.00 | 1303.25 | 3393.50 | 3739.75 | 5082.00 | ▅▂▂▇▂ |
| lon | 0 | 1 | -71.06 | 0.08 | -71.29 | -71.09 | -71.05 | -71.02 | -70.81 | ▁▂▇▂▁ |
| lat | 0 | 1 | 42.22 | 0.06 | 42.03 | 42.18 | 42.22 | 42.25 | 42.38 | ▁▃▇▃▁ |
| medv | 0 | 1 | 22.53 | 9.20 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| cmedv | 0 | 1 | 22.53 | 9.18 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| crim | 0 | 1 | 3.61 | 8.60 | 0.01 | 0.08 | 0.26 | 3.68 | 88.98 | ▇▁▁▁▁ |
| zn | 0 | 1 | 11.36 | 23.32 | 0.00 | 0.00 | 0.00 | 12.50 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.14 | 6.86 | 0.46 | 5.19 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.38 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▆▅▁ |
| rm | 0 | 1 | 6.28 | 0.70 | 3.56 | 5.89 | 6.21 | 6.62 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.57 | 28.15 | 2.90 | 45.02 | 77.50 | 94.07 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.21 | 5.19 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.55 | 8.71 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 408.24 | 168.54 | 187.00 | 279.00 | 330.00 | 666.00 | 711.00 | ▇▇▃▁▇ |
| ptratio | 0 | 1 | 18.46 | 2.16 | 12.60 | 17.40 | 19.05 | 20.20 | 22.00 | ▁▃▅▅▇ |
| b | 0 | 1 | 356.67 | 91.29 | 0.32 | 375.38 | 391.44 | 396.22 | 396.90 | ▁▁▁▁▇ |
| lstat | 0 | 1 | 12.65 | 7.14 | 1.73 | 6.95 | 11.36 | 16.96 | 37.97 | ▇▇▅▂▁ |
Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes package to one-hot encode them.
# Predicting median house prices rec <- recipe(cmedv ~ ., data = BostonHousing2) %>% # Collapse categories where population is < 3% step_other(town, chas, threshold = .03, other = "Other") %>% # Create dummy variables for all factor variables step_dummy(all_nominal_predictors()) # Train the recipe on the data set prep <- prep(rec, training = BostonHousing2) # Create the final model matrix model_df <- bake(prep, new_data = BostonHousing2) # All levels have been one hot encoded and separate columns have been appended to the model matrix colnames(model_df) ## [1] "tract" "lon" "lat" ## [4] "medv" "crim" "zn" ## [7] "indus" "nox" "rm" ## [10] "age" "dis" "rad" ## [13] "tax" "ptratio" "b" ## [16] "lstat" "cmedv" "town_Boston.Savin.Hill" ## [19] "town_Cambridge" "town_Lynn" "town_Newton" ## [22] "town_Other" "chas_X1"
Next, we can use the resample package to create test/train splits.
splits <- rsample::initial_split(model_df, prop = 0.7) # Training set train_df <- rsample::training(splits) # Test set test_df <- rsample::testing(splits) dim(train_df) ## [1] 354 23 dim(test_df) ## [1] 152 23
Finding optimal parameters
Now we can start to run some optimisations using the ParBayesianOptimization package.
# The xgboost interface accepts matrices
X <- train_df %>%
# Remove the target variable
select(!medv, !cmedv) %>%
as.matrix()
# Get the target variable
y <- train_df %>%
pull(cmedv)
# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
fold2 = as.integer(seq(2, nrow(X), by = 5)))
We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv() as the value that needs to be optimised.
# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
param <- list(
# Hyter parameters
eta = eta,
max_depth = max_depth,
min_child_weight = min_child_weight,
subsample = subsample,
lambda = lambda,
alpha = alpha,
# Tree model
booster = "gbtree",
# Regression problem
objective = "reg:squarederror",
# Use the Mean Absolute Percentage Error
eval_metric = "mape")
xgbcv <- xgb.cv(params = param,
data = X,
label = y,
nround = 50,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
lst <- list(
# First argument must be named as "Score"
# Function finds maxima so inverting the output
Score = -min(xgbcv$evaluation_log$test_mape_mean),
# Get number of trees for the best performing model
nrounds = xgbcv$best_iteration
)
return(lst)
}
Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.
bounds <- list(eta = c(0.001, 0.2),
max_depth = c(1L, 10L),
min_child_weight = c(1, 50),
subsample = c(0.1, 1),
lambda = c(1, 10),
alpha = c(1, 10))
We can now run the optimiser to find a set of optimal hyper-parameters.
set.seed(1234) bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3) # Show relevant columns from the summary object bayes_out$scoreSummary[1:5, c(3:8, 13)] ## eta max_depth min_child_weight subsample lambda alpha Score ## 1: 0.13392137 8 4.913332 0.2105925 4.721124 3.887629 -0.0902970 ## 2: 0.19400811 2 25.454160 0.9594105 9.329695 3.173695 -0.1402720 ## 3: 0.16079775 2 14.035652 0.5118349 1.229953 5.093530 -0.1475580 ## 4: 0.08957707 4 12.534842 0.3844404 4.358837 1.788342 -0.1410245 ## 5: 0.02876388 4 36.586761 0.8107181 6.137100 6.039125 -0.3061535 # Get best parameters data.frame(getBestPars(bayes_out)) ## eta max_depth min_child_weight subsample lambda alpha ## 1 0.1905414 8 1.541476 0.8729207 1 1
Fitting the model
We can now fit a model and check how well these parameters work.
# Combine best params with base params
opt_params <- append(list(booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "mae"),
getBestPars(bayes_out))
# Run cross validation
xgbcv <- xgb.cv(params = opt_params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
# Get optimal number of rounds
nrounds = xgbcv$best_iteration
# Fit a xgb model
mdl <- xgboost(data = X, label = y,
params = opt_params,
maximize = F,
early_stopping_rounds = 5,
nrounds = nrounds,
verbose = 0)
# Evaluate performance
actuals <- test_df$cmedv
predicted <- test_df %>%
select_at(mdl$feature_names) %>%
as.matrix %>%
predict(mdl, newdata = .)
# Compute MAPE
mean(abs(actuals - predicted)/actuals)
## [1] 0.008391282
Compare with grid search
grd <- expand.grid(
eta = seq(0.001, 0.2, length.out = 5),
max_depth = seq(2L, 10L, by = 1),
min_child_weight = seq(1, 25, length.out = 3),
subsample = c(0.25, 0.5, 0.75, 1),
lambda = c(1, 5, 10),
alpha = c(1, 5, 10))
dim(grd)
grd_out <- apply(grd, 1, function(par){
par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae"))
mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0)
lst <- data.frame(par, score = mdl$best_score)
return(lst)
})
grd_out <- do.call(rbind, grd_out)
best_par <- grd_out %>%
data.frame() %>%
arrange(score) %>%
.[1,]
# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
nrounds = xgbcv$best_iteration
mdl <- xgboost(data = X,
label = y,
params = params,
maximize = F,
early_stopping_rounds = 5,
nrounds = nrounds,
verbose = 0)
# Evaluate on test set
act <- test_df$medv
pred <- test_df %>%
select_at(mdl$feature_names) %>%
as.matrix %>%
predict(mdl, newdata = .)
mean(abs(act - pred)/act)
## [1] 0.009407723
While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.
Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
