Site icon R-bloggers

Boilerplate XGBoost with {healthyR.ai}

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

XGBoost, short for “eXtreme Gradient Boosting,” is a powerful and popular machine learning library that is specifically designed for gradient boosting. It is an open-source library and is available in many programming languages, including R.

Gradient boosting is a technique that combines the predictions of multiple weak models to create a strong, more accurate model. XGBoost is an optimized version of gradient boosting that is designed to run faster and more efficiently than other implementations.

Let’s take a look at a simple example of how to use XGBoost in R. We will use the iris dataset, a well-known dataset that contains 150 observations of iris flowers, each with four features (sepal length, sepal width, petal length, and petal width) and one target variable (the species of iris). Our goal is to train a model to predict the species of an iris flower based on its features.

First, we need to install the “xgboost” package in R:

install.packages("xgboost")

Next, we load the iris dataset and split it into training and test sets:

data(iris)
set.seed(123)
indices <- sample(1:nrow(iris), 0.8*nrow(iris))
train_data <- iris[indices, 1:4]
train_label <- iris[indices, 5]
test_data <- iris[-indices, 1:4]
test_label <- iris[-indices, 5]

Now we can train our XGBoost model:

library(xgboost)
xgb_model <- xgboost(
  data = train_data, 
  label = train_label, 
  nrounds = 100, 
  objective = "multi:softmax", 
  num_class = 3
  )

Here, we specified the training data, labels, number of rounds (iterations) to run, the objective (multiclass classification) and the number of classes.

Finally, we can use the trained model to make predictions on the test set:

predictions <- predict(xgb_model, test_data)

We can also evaluate the performance of our model by comparing the predicted labels to the true labels using metrics such as accuracy:

accuracy <- mean(predictions == test_label)

In this example, we used XGBoost to train a model to predict the species of iris flowers based on their features. We saw that XGBoost is a powerful and efficient library for gradient boosting, and it can be easily integrated into a R script.

Keep in mind that this is a simple example, and in real-world scenarios, more preprocessing and parameter tuning is necessary to achieve optimal performance. Also, the dataset is small, and the number of rounds used is also small, which is not ideal for real-world scenarios. But this example shows the basic usage of XGBoost in R.

Ok, so, what’s the point? Is there a possibly easier way to do this…yes! You can use the boilerplace function hai_auto_xgboost() and it’s data prep helper hai_xgboost_data_prepper() from the {healthyR.ai} library. Let’s see how that works.

< section id="function" class="level1">

Function

Here is the data prepper function and it’s arguments.

hai_xgboost_data_prepper(.data, .recipe_formula)

Here is the boilerplate function

hai_auto_xgboost(
  .data,
  .rec_obj,
  .splits_obj = NULL,
  .rsamp_obj = NULL,
  .tune = TRUE,
  .grid_size = 10,
  .num_cores = 1,
  .best_metric = "f_meas",
  .model_type = "classification"
)

Here are it’s arguments.

< section id="example" class="level1">

Example

Let’s take a look at an example and it’s output. This is using {parsnip} under the hood.

library(healthyR.ai)

data <- iris

rec_obj <- hai_xgboost_data_prepper(data, Species ~ .)

auto_xgb <- hai_auto_xgboost(
  .data = data,
  .rec_obj = rec_obj,
  .best_metric = "f_meas",
  .num_cores = 1
)

There are three main outputs to this function, which are:

  1. recipe_info
  2. model_info
  3. tuned_info

Let’s take a look at each. First the recipe_info

auto_xgb$recipe_info

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Operations:

Factor variables from tidyselect::vars_select_helpers$where(is.charac...
Novel factor level assignment for recipes::all_nominal_predictors()
Dummy variables from recipes::all_nominal_predictors()
Zero variance filter on recipes::all_predictors()

Now the model_info

auto_xgb$model_info

$model_spec
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = tune::tune()
  min_n = tune::tune()
  tree_depth = tune::tune()
  learn_rate = tune::tune()
  loss_reduction = tune::tune()
  sample_size = tune::tune()

Computational engine: xgboost 


$wflw
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_string2factor()
• step_novel()
• step_dummy()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = tune::tune()
  min_n = tune::tune()
  tree_depth = tune::tune()
  learn_rate = tune::tune()
  loss_reduction = tune::tune()
  sample_size = tune::tune()

Computational engine: xgboost 


$fitted_wflw
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_string2factor()
• step_novel()
• step_dummy()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 2.5 Mb 
call:
  xgboost::xgb.train(params = list(eta = 0.10962507492329, max_depth = 13L, 
    gamma = 0.000498577409120534, colsample_bytree = 1, colsample_bynode = 1, 
    min_child_weight = 3L, subsample = 0.594320066112559), data = x$data, 
    nrounds = 1240L, watchlist = x$watchlist, verbose = 0, nthread = 1, 
    objective = "multi:softprob", num_class = 3L)
params (as set within xgb.train):
  eta = "0.10962507492329", max_depth = "13", gamma = "0.000498577409120534", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "3", subsample = "0.594320066112559", nthread = "1", objective = "multi:softprob", num_class = "3", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 4 
niter: 1240
nfeatures : 4 
evaluation_log:
    iter training_mlogloss
       1        0.96929822
       2        0.85785438
---                       
    1239        0.07815044
    1240        0.07808817

$was_tuned
[1] "tuned"

Now the tuned_info

auto_xgb$tuned_info

$tuning_grid
# A tibble: 10 × 6
   trees min_n tree_depth learn_rate loss_reduction sample_size
   <int> <int>      <int>      <dbl>          <dbl>       <dbl>
 1   926     6          2    0.0246        2.21e- 1       0.952
 2  1510    25         14    0.00189       1.01e+ 1       0.424
 3  1077    29          9    0.195         1.34e- 5       0.319
 4   795    32          3    0.00102       1.64e- 3       0.686
 5   368    22          4    0.00549       2.97e- 7       0.735
 6  1240     3         13    0.110         4.99e- 4       0.594
 7  1839    18          5    0.0501        1.67e- 7       0.273
 8   139    11         10    0.0153        1.17e- 2       0.483
 9   470    40          8    0.0906        6.79e-10       0.168
10  1732    16         11    0.00667       9.19e- 9       0.883

$cv_obj
# Monte Carlo cross-validation (0.75/0.25) with 25 resamples  
# A tibble: 25 × 2
   splits          id        
   <list>          <chr>     
 1 <split [84/28]> Resample01
 2 <split [84/28]> Resample02
 3 <split [84/28]> Resample03
 4 <split [84/28]> Resample04
 5 <split [84/28]> Resample05
 6 <split [84/28]> Resample06
 7 <split [84/28]> Resample07
 8 <split [84/28]> Resample08
 9 <split [84/28]> Resample09
10 <split [84/28]> Resample10
# … with 15 more rows

$tuned_results
# Tuning results
# Monte Carlo cross-validation (0.75/0.25) with 25 resamples  
# A tibble: 25 × 4
   splits          id         .metrics            .notes          
   <list>          <chr>      <list>              <list>          
 1 <split [84/28]> Resample01 <tibble [110 × 10]> <tibble [1 × 3]>
 2 <split [84/28]> Resample02 <tibble [110 × 10]> <tibble [1 × 3]>
 3 <split [84/28]> Resample03 <tibble [110 × 10]> <tibble [1 × 3]>
 4 <split [84/28]> Resample04 <tibble [110 × 10]> <tibble [1 × 3]>
 5 <split [84/28]> Resample05 <tibble [110 × 10]> <tibble [1 × 3]>
 6 <split [84/28]> Resample06 <tibble [110 × 10]> <tibble [1 × 3]>
 7 <split [84/28]> Resample07 <tibble [110 × 10]> <tibble [1 × 3]>
 8 <split [84/28]> Resample08 <tibble [110 × 10]> <tibble [1 × 3]>
 9 <split [84/28]> Resample09 <tibble [110 × 10]> <tibble [1 × 3]>
10 <split [84/28]> Resample10 <tibble [110 × 10]> <tibble [1 × 3]>
# … with 15 more rows

There were issues with some computations:

  - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...   - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred...

Run `show_notes(.Last.tune.result)` for more information.

$grid_size
[1] 10

$best_metric
[1] "f_meas"

$best_result_set
# A tibble: 1 × 12
  trees min_n tree_depth learn_rate loss_r…¹ sampl…² .metric .esti…³  mean     n
  <int> <int>      <int>      <dbl>    <dbl>   <dbl> <chr>   <chr>   <dbl> <int>
1  1240     3         13      0.110 0.000499   0.594 f_meas  macro   0.944    25
# … with 2 more variables: std_err <dbl>, .config <chr>, and abbreviated
#   variable names ¹​loss_reduction, ²​sample_size, ³​.estimator

$tuning_grid_plot

Tuning Grid

Voila!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.