Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
@drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.
Why a post on xgboost and pipelearner?
xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!
The only problem – out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.
Setup
To follow this post you’ll need the following packages:
# Install (if necessary) install.packages(c("xgboost", "tidyverse", "devtools")) devtools::install_github("drsimonj/pipelearner") # Attach library(tidyverse) library(xgboost) library(pipelearner) library(lazyeval)
Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:
data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data' d <- read_csv( data_url, col_names = c('id', 'thinkness', 'size_uniformity', 'shape_uniformity', 'adhesion', 'epith_size', 'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>% select(-id) %>% # Remove id; not useful here filter(nuclei != '?') %>% # Remove records with missing data mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign mutate_all(as.numeric) # All to numeric; needed for XGBoost d #> # A tibble: 683 × 10 #> thinkness size_uniformity shape_uniformity adhesion epith_size nuclei #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 5 1 1 1 2 1 #> 2 5 4 4 5 7 10 #> 3 3 1 1 1 2 2 #> 4 6 8 8 1 3 4 #> 5 4 1 1 3 2 1 #> 6 8 10 10 8 7 10 #> 7 1 1 1 1 2 10 #> 8 2 1 2 1 2 1 #> 9 2 1 1 1 2 1 #> 10 4 2 1 1 2 1 #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>, #> # nucleoli <dbl>, mitoses <dbl>, cancer <dbl>
pipelearner
pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.
Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit
and maxdepth
for an rpart decision tree:
pipelearner(d, rpart::rpart, cancer ~ ., minsplit = c(2, 4, 6, 8, 10), maxdepth = c(2, 3, 4, 5))
The challenge for xgboost:
pipelearner expects a model function that has two arguments:
data
andformula
xgboost
Here’s an xgboost model:
# Prep data (X) and labels (y) X <- select(d, -cancer) %>% as.matrix() y <- d$cancer # Fit the model fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic") #> [1] train-rmse:0.372184 #> [2] train-rmse:0.288560 #> [3] train-rmse:0.230171 #> [4] train-rmse:0.188965 #> [5] train-rmse:0.158858 # Examine accuracy predicted <- as.numeric(predict(fit, X) >= .5) mean(predicted == y) #> [1] 0.9838946
Look like we have a model with 98.39% accuracy on the training data!
Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!
Wrapper function to parse data
and formula
To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data
and formula
, and uses these to pass a feature matrix and label vector to xgboost
:
pl_xgboost <- function(data, formula, ...) { data <- as.data.frame(data) X_names <- as.character(f_rhs(formula)) y_name <- as.character(f_lhs(formula)) if (X_names == '.') { X_names <- names(data)[names(data) != y_name] } X <- data.matrix(data[, X_names]) y <- data[[y_name]] xgboost(data = X, label = y, ...) }
Let’s try it out:
pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic") #> [1] train-rmse:0.372184 #> [2] train-rmse:0.288560 #> [3] train-rmse:0.230171 #> [4] train-rmse:0.188965 #> [5] train-rmse:0.158858 # Examine accuracy pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5) mean(pl_predicted == y) #> [1] 0.9838946
Perfect!
Bringing it all together
We can now use pipelearner
and pl_xgboost()
for easy grid searching:
pl <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = c(5, 10, 25), eta = c(.1, .3), max_depth = c(4, 6)) fits <- pl %>% learn() #> [1] train-rmse:0.453832 #> [2] train-rmse:0.412548 #> ... fits #> # A tibble: 12 × 9 #> models.id cv_pairs.id train_p fit target model #> <chr> <chr> <dbl> <list> <chr> <chr> #> 1 1 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 2 10 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 3 11 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 4 12 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 5 2 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 6 3 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 7 4 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 8 5 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 9 6 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 10 7 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 11 8 1 1 <S3: xgb.Booster> cancer pl_xgboost #> 12 9 1 1 <S3: xgb.Booster> cancer pl_xgboost #> # ... with 3 more variables: params <list>, train <list>, test <list>
Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:
accuracy <- function(fit, data, target_var) { # Convert resample object to data frame data <- as.data.frame(data) # Get feature matrix and labels X <- data %>% select(-matches(target_var)) %>% as.matrix() y <- data[[target_var]] # Obtain predicted class y_hat <- as.numeric(predict(fit, X) > .5) # Return accuracy mean(y_hat == y) } results <- fits %>% mutate( # hyperparameters nrounds = map_dbl(params, "nrounds"), eta = map_dbl(params, "eta"), max_depth = map_dbl(params, "max_depth"), # Accuracy accuracy_train = pmap_dbl(list(fit, train, target), accuracy), accuracy_test = pmap_dbl(list(fit, test, target), accuracy) ) %>% # Select columns and order rows select(nrounds, eta, max_depth, contains("accuracy")) %>% arrange(desc(accuracy_test), desc(accuracy_train)) results #> # A tibble: 12 × 5 #> nrounds eta max_depth accuracy_train accuracy_test #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 25 0.3 6 1.0000000 0.9489051 #> 2 25 0.3 4 1.0000000 0.9489051 #> 3 10 0.3 6 0.9981685 0.9489051 #> 4 5 0.3 6 0.9945055 0.9489051 #> 5 10 0.1 6 0.9945055 0.9489051 #> 6 25 0.1 6 0.9945055 0.9489051 #> 7 5 0.1 6 0.9926740 0.9489051 #> 8 25 0.1 4 0.9890110 0.9489051 #> 9 10 0.3 4 0.9871795 0.9489051 #> 10 5 0.3 4 0.9853480 0.9489051 #> 11 10 0.1 4 0.9853480 0.9416058 #> 12 5 0.1 4 0.9835165 0.9416058
Our top model, which got 94.89% on a test set, had nrounds
= 25, eta
= 0.3, and max_depth
= 6.
Either way, the trick was the wrapper function pl_xgboost()
that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.
Bonus: bootstrapped cross validation
For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!
results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>% learn_cvpairs(n = 100) %>% learn() %>% mutate( test_accuracy = pmap_dbl(list(fit, test, target), accuracy) ) #> [1] train-rmse:0.357471 #> [2] train-rmse:0.256735 #> ... results %>% ggplot(aes(test_accuracy)) + geom_histogram(bins = 30) + scale_x_continuous(labels = scales::percent) + theme_minimal() + labs(x = "Accuracy", y = "Number of samples", title = "Test accuracy distribution for\n100 bootstrapped samples")
Sign off
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.