Monotonic Binning with GBM
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In addition to monotonic binning algorithms introduced in my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), two more functions based on Generalized Boosted Regression Models have been added to my GitHub repository, gbm_bin() and gbmcv_bin().
The function gbm_bin() estimates a GBM model without the cross validation and tends to generate a more granular binning outcome.
gbm_bin(df, bad, tot_derog) | |
# $df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321 | |
# 04 $X > 3 & $X <= 9 848 0.1453 0 282 0.3325 0.6593 0.0750 3.2492 | |
# 05 $X > 9 225 0.0385 0 77 0.3422 0.7025 0.0228 0.0000 | |
# $cuts | |
# [1] 1 2 3 9 |
The function gbmcv_bin() estimates a GBM model with the cross validation (CV). Therefore, it would generate a more stable but coarse binning outcome. Nonetheless, the computation is more expensive due to CV, especially for large datasets.
gbmcv_bin(df, bad, tot_derog) | |
### OUTPUT ### | |
# $df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 1405 0.2407 0 445 0.3167 0.5871 0.0970 0.0000 | |
# $cuts | |
# [1] 1 2 |
Motivated by the idea of my friend Talbot (https://www.linkedin.com/in/talbot-katz-b76785), I also drafted a function pava_bin() based upon the Pool Adjacent Violators Algorithm (PAVA) and compared it with the iso_bin() function based on the isotonic regression. As shown in the comparison below, there is no difference in the binning outcome. However, the computing cost of pava_bin() function is higher given that PAVA is an iterative algorithm solving for the monotonicity.
pava_bin(df, bad, tot_derog)$df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321 | |
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370 | |
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000 | |
iso_bin(df, bad, tot_derog)$df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321 | |
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370 | |
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000 |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.