Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here’s a short post on why one should consider developing a segmented model when building credit risk scorecards to help optimize approval rates. While segmentation offers a variety of benefits, this post aims to offer a different perspective.
I will be reusing code from a previous post. Feel free to navigate to this post for additional information. Note that sample datasets are available for download from (here)
Packages
#install.packages("pacman") ## Install if needed library(pacman) # p_load automatically installs packages if needed p_load(dplyr, magrittr, scales, pROC)
Sample data
smple <- read.csv("../../../download/credit_sample.csv") # Define target codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off") smple %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0)) # Some basic data cleaning smple[is.na(smple)] <- -1 smple %<>% # Remove cases where home ownership and payment plan are not reported filter(! home_ownership %in% c("", "NONE"), pymnt_plan != "") %>% # Convert these two variables into factors mutate(home_ownership = factor(home_ownership), pymnt_plan = factor(pymnt_plan))
Segments
Typically, credit risk scorecards would have segments like known goods
, known bads
, ever chaged off
etc. For simplicity, I will use the available FICO grade
variable in the dataset to segment known goods
and known bads
. For the purposes of this post, let’s focus on the known goods
customers (mostly because policy filters would remove the known bad customers).
smple %>% group_by(grade) %>% summarise(total = n(), bads = sum(bad_flag == 1)) %>% mutate(event_rate = percent(bads / total)) ## # A tibble: 7 × 4 ## grade total bads event_rate ## <chr> <int> <int> <chr> ## 1 A 1882 68 3.6% ## 2 B 3056 239 7.8% ## 3 C 2805 373 13.3% ## 4 D 1435 257 17.9% ## 5 E 590 144 24.4% ## 6 F 182 65 35.7% ## 7 G 49 16 32.7%
Since the event rates are significantly higher, let’s keep borrowers who have a FICO grade
of "E", "F" & "G"
in the known bads
bucket and rest in the known goods
bucket.
smple %<>% mutate(segment = ifelse(grade %in% c("E", "F", "G"), "KB", "KG")) smple %>% group_by(segment) %>% tally() ## # A tibble: 2 × 2 ## segment n ## <chr> <int> ## 1 KB 821 ## 2 KG 9178
Model training
Let’s train two separate sets models like so:
- One to the entire population
- And one model to the
known goods
# Create a formula object to be used across models form <- as.formula( "bad_flag ~ mths_since_last_delinq + total_pymnt + acc_now_delinq + inq_last_6mths + delinq_amnt + mths_since_last_record + mths_since_recent_revol_delinq + mths_since_last_major_derog + mths_since_recent_inq + mths_since_recent_bc + num_accts_ever_120_pd" )
Model on the entire population
set.seed(1234) # Train Test split idx <- sample(1:nrow(smple), size = 0.7 * nrow(smple), replace = F) train <- smple[idx,] test <- smple[-idx,] # Using a GLM model for simplicity mdl_pop <- glm( formula = form, family = "binomial", data = train ) # Get performance on entire sample auc(test$bad_flag, predict(mdl_pop, test)) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases ## Area under the curve: 0.6602
Now that we have a model, let’s assume an expected default propensity
target of 2%
. That is to say I can only approve those customers who I believe have an expected default propensity of 2%
or less. Given this, what would be my expected approval rate
? Note that I need to maximise my approval rate (to better utilise my applicant funnel).
# Output scaling function scaling_func <- function(vec, PDO = 30, OddsAtAnchor = 5, Anchor = 700){ beta <- PDO / log(2) alpha <- Anchor - PDO * OddsAtAnchor # Simple linear scaling of the log odds scr <- alpha - beta * vec # Round off return(round(scr, 0)) } # Find the number of customers that can be approved such that # the cumulative bad rate is <= target smple %>% filter(segment == "KG") %>% ## Filter on known goods only mutate(pred = predict(mdl_pop, newdata = .), ## Generate predictions score = scaling_func(pred), ## Scale output total = n()) %>% arrange(pred) %>% mutate(rn = row_number(), c_bad_rate = cumsum(bad_flag)/rn) %>% filter(c_bad_rate <= 0.02) %>% summarise(approve_count = n(), total = mean(total), score_cutoff = min(score), bad_rate = sum(bad_flag)/n()) %>% mutate(approval_rate = approve_count / total) ## approve_count total score_cutoff bad_rate approval_rate ## 1 818 9178 689 0.01833741 0.08912617
Based on the unsegmented
model, if we choose a score cutoff of >=689
, we should expect a bad rate of ~2%
at an approval rate of ~9%
.
Model on the known good population
set.seed(1234) # Filter on KG sample_kg <- smple %>% filter(segment == "KG") idx <- sample(1:nrow(sample_kg), size = 0.7 * nrow(sample_kg), replace = F) train <- sample_kg[idx,] test <- sample_kg[-idx,] mdl_kg <- glm( formula = form, family = "binomial", data = train ) # Get performance on entire sample auc(test$bad_flag, predict(mdl_kg, test)) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases ## Area under the curve: 0.6496
Simulations
To evaluate the through the door population better, let’s run some simulations. Let’s assume that only known good
customers will be assessed through the risk scorecard.
# Function to generate simulations generate_sim <- function(mdl, nSim = 500, sample_size = 5000, cutoff = 650){ # Vectors to store output bad_rates <- c() approval_rates <- c() for(i in 1:nSim){ out <- smple %>% # Filter on KG segment filter(segment == "KG") %>% # Randomly sample with replacement sample_n(sample_size, replace = T) %>% # Generate model output mutate(pred = predict(mdl, newdata = .), score = scaling_func(pred)) %>% filter(score >= cutoff) %>% summarise(bad_rate = sum(bad_flag)/n(), app_rate = n() / sample_size) # Store output bad_rates[i] <- out$bad_rate approval_rates[i] <- out$app_rate } # Plot output par(mfrow = c(1, 2)) plot(density(bad_rates), main = "Event Rate") abline(v = mean(bad_rates)) plot(density(approval_rates), main = "Approval Rate") abline(v = mean(approval_rates)) par(mfrow = c(1, 1)) } # Simulate using the model built on the entire population generate_sim(mdl_pop, cutoff = 689)
Based on the above simulations, when using the model trained on the entire population and using a threshold of >=689
the average simulated event rates and approval rates are close to the expected rates from before. But what if we use the model developed only on the known good population
?
smple %>% filter(segment == "KG") %>% ## Filter on known goods only mutate(pred = predict(mdl_kg, newdata = .), ## Generate predictions score = scaling_func(pred), ## Scale output total = n()) %>% arrange(pred) %>% mutate(rn = row_number(), c_bad_rate = cumsum(bad_flag)/rn) %>% filter(c_bad_rate <= 0.02) %>% summarise(approve_count = n(), total = mean(total), score_cutoff = min(score), bad_rate = sum(bad_flag)/n()) %>% mutate(approval_rate = approve_count / total) ## approve_count total score_cutoff bad_rate approval_rate ## 1 904 9178 696 0.0199115 0.0984964 # Simulate using the model built only on the known good population generate_sim(mdl_kg, cutoff = 696)
When using the known-good model, we can achieve a slightly higher approval rate keeping the event rate more or less the same. While the difference is not significant, a ~1% higher approval rate on a large funnel could be significant (psst. monthly targets anyone?).
While most modelers/data scientists understand this, explaining the need to build segmented models to a business owner can be easier if the impact can be linked to business outcomes 😉.
Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.