Monotonic binning using XGBOOST
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When developing credit risk scorecards, it is generally a good idea to discretise (bin) numeric variables in a manner that ensures monotonically increasing or decreasing event rates as the variable increases or decreases. While discretising individual variables adds stability to the model, monotonic bins ensure that the model output is consistent and interpretable (i.e. if variable ‘x’ increases, the computed score increases across each bin). We’ll explore how to do create monotonic bins in R
using xgboost
.
Libraries
# Pacman is a package management tool install.packages("pacman") library(pacman) # p_load automatically installs packages if needed p_load(recipes, dplyr, ggplot2, xgboost, gridExtra)
Sample dataset
Here’s a small sample sample of the Lending Club dataset available on Kaggle.
sample <- read.csv("credit_sample.csv") dim(sample) ## [1] 10000 153 class(sample) ## [1] "data.frame"
Create a target
Like in my previous post, I’ll use the loan_status
column as the target variable.
# Specific values to be tagged as 'bad' codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off") model_data <- sample %>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0))
Data prep
We’ll use the recipes
package to remove non numeric variables and impute missing values using. For additional details, see the documentation for recipes
. Note that the formula inside the recipe()
function decides which columns are predictors and which column is the target.
# Specify basic recipe rec <- recipe(bad_flag ~ ., data = model_data) %>% step_select(where(is.numeric)) %>% step_impute_median(all_predictors()) rec <- prep(rec, training = model_data) # Not doing a test/train split train <- bake(rec, new_data = model_data)
Analysing directional trend
Now that we have a clean training dataset, its important to ascertain how the event rate should change when a particular variable changes. This is important since this directional trend will dictate how we constraint the xgboost
model.
A good way to do this is to use both data and intuition. As an example, consider the variable inq_last_6mths
(number of inquiries in the last 6 months). Intuitively, as the number of inquiries increase, one would expect the event rate (chance of default) to increase. We can validate this using a simple bar chart like the one shown below.
data.frame(x = model_data$inq_last_6mths, y = model_data$bad_flag) %>% filter(x <= 5) %>% group_by(x) %>% summarise(count = n(), events = sum(y)) %>% mutate(pct = events/count) %>% ggplot(aes(x = factor(x), y = pct)) + geom_col() + theme_minimal() + labs(x = "# of inquiries in past 6 months", y = "Default rate", title = "Default rate vs number of inquiries in past 6 months", subtitle = "Positive relationship")
This confirms our hypothesis and also tells us that we need to constraint the xgboost
model such the probability outcome increases as the value of the variable inq_last_6mths
increases.
xgboost model
We’ll create an xgb model with the following specs:
- One boosting iteration
monotone_constraints
= 1 (i.e. splits which only increase the probability outcome)max_depth
= 10 (as an example, can be deeper if one needs additional bins)
mdl <- xgboost( data = train %>% select(inq_last_6mths) %>% ## Select only inq_last_6mths as.matrix(), ## convert to matrix since the xgboost() interface only accepts matrices label = train[["bad_flag"]], ## Target variable nrounds = 1, ## Only one boosting iteration params = list(objective = "binary:logistic", ## Binary outcome monotone_constraints = 1, max_depth = 10)) ## 1 ## [07:07:34] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. ## [1] train-logloss:0.541928
Retrieving splits
Now that we have a model, we need to retrieve the split points and evaluate whether the binning scheme is intuitive (or not).
# Convert model into a dataframe like output splits <- xgb.model.dt.tree(model = mdl) # Add +/- Inf to provide coverage for values not observed # in the training dataset cuts <- c(-Inf, sort(splits$Split), Inf) # Plot bins and event rates data.frame(target = train$bad_flag, buckets = cut(train$inq_last_6mths, breaks = cuts, include.lowest = T, right = T, ordered_result = T)) %>% group_by(buckets) %>% summarise(total = n(), events = sum(target == 1)) %>% mutate(pct = events/total) %>% ggplot(aes(x = buckets, y = pct)) + geom_col() + theme_minimal() + labs(x = "Bins", y = "Default rate", title = "Monotonic binning for number of inquiries in past 6 months", subtitle = "Monotonically increasing event rate")
Creating a function
Finally, we can encapsulate everything we have done so far inside a function for better usability.
create_bins <- function(var, outcome, max_depth = 10, plot = T){ # Check if relationship is positive or negative # Using spearman since it measures strength of monotonic relationship corr <- cor(var, outcome, method = "spearman") direction <- ifelse(corr > 0, 1, -1) # Build XGB model mdl <- xgboost( verbose = 0, data = as.matrix(var), label = outcome, nrounds = 1, params = list(objective = "binary:logistic", ## Binary outcome monotone_constraints = direction, max_depth = max_depth, eval_metric = "auc")) # Retrieve splits splits <- xgb.model.dt.tree(model = mdl) cuts <- c(-Inf, sort(splits$Split), Inf) binned <- cut(var, breaks = cuts, include.lowest = T, right = T, ordered_result = T) # Create an event rate plot plt <- data.frame(outcome, binned) %>% group_by(binned) %>% summarise(total = n(), events = sum(outcome == 1)) %>% mutate(pct = events/total) %>% ggplot(aes(x = binned, y = pct)) + geom_col() + theme_minimal() + labs(x = "Bins", y = "Event Rate", title = "Monotonic binning output") if(plot == T){ print(plt) } # List to be returned lst <- list( var = var, binned_var = binned, cor = corr, plot = plt ) return(lst) } # Test function v1 <- create_bins(train$fico_range_high, train$bad_flag, max_depth = 10, plot = F) v2 <- create_bins(train$delinq_amnt, train$bad_flag, max_depth = 10, plot = F) v3 <- create_bins(train$int_rate, train$bad_flag, max_depth = 10, plot = F) v4 <- create_bins(train$annual_inc, train$bad_flag, max_depth = 10, plot = F) grid.arrange(v1$plot + labs(subtitle = "Fico Range High"), v2$plot + labs(subtitle = "Delinq Amount"), v3$plot + labs(subtitle = "Interest Rate"), v4$plot + labs(subtitle = "Annual Income"), ncol = 2)
And that’s it! We can use what we just built to discretise variables we need, perform one-hot-encoding
or WOE-transformations
and feed the appropriate model matrix to our choice of statistical routine.
Parting notes
Check out this package called MonotonicOptimalBinning
by Wensui Liu which offers multiple binning strategies like isotonic binning, quantile binning and k-means binning.
Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.