Binning with Weights
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After working on the MOB package, I received requests from multiple users if I can write a binning function that takes the weighting scheme into consideration. It is a legitimate request from the practical standpoint. For instance, in the development of fraud detection models, we often would sample down non-fraud cases given an extremely low frequency of fraud instances. After the sample down, a weight value > 1 should be assigned to all non-fraud cases to reflect the fraud rate in the pre-sample data.
While accommodating the request for weighting cases is trivial, I’d like to do a simple experitment showing what the impact might be with the consideration of weighting.
– First of all, let’s apply the monotonic binning to a variable named “tot_derog”. In this unweighted binning output, KS = 18.94, IV = 0.21, and WoE values range from -0.38 to 0.64.
– In the first trial, a weight value = 5 is assigned to cases with Y = 0 and a weight value = 1 assigned to cases with Y = 1. As expected, frequency, distribution, bad_frequency, and bad_rate changed. However, KS, IV, and WoE remain identical.
– In the second trial, a weight value = 1 is assigned to cases with Y = 0 and a weight value = 5 assigned to cases with Y = 1. Once again, KS, IV, and WoE are still the same as the unweighted output.
The conclusion from this demonstrate is very clear. In cases of two-value weights assigned to the binary Y, the variable importance reflected by IV / KS and WoE values should remain identical with or without weights. However, if you are concerned about the binning distribution and the bad rate in each bin, the function wts_bin() should do the correction and is available in the project repository (https://github.com/statcompute/MonotonicBinning).
derog_bin <- qtl_bin(df, bad, tot_derog) | |
derog_bin | |
#$df | |
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks | |
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 587 0.1006 0 176 0.2998 0.5078 0.0298 10.6623 | |
# 04 $X > 4 818 0.1401 0 269 0.3289 0.6426 0.0685 0.0000 | |
# $cuts | |
# [1] 1 2 4 | |
wts_bin(derog_bin$df, c(1, 5)) | |
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks | |
# 00 is.na($X) 493 0.0464 350 0.7099 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 5981 0.5631 2800 0.4681 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 962 0.0906 605 0.6289 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 1291 0.1216 880 0.6816 0.5078 0.0298 10.6623 | |
# 04 $X > 4 1894 0.1783 1345 0.7101 0.6426 0.0685 0.0000 | |
wts_bin(derog_bin$df, c(5, 1)) | |
# bin rule wt_freq wt_dist wt_bads wt_badrate wt_woe wt_iv wt_ks | |
# 00 is.na($X) 785 0.0322 70 0.0892 0.6416 0.0178 2.7716 | |
# 01 $X <= 1 16465 0.6748 560 0.0340 -0.3811 0.0828 18.9469 | |
# 02 $X > 1 & $X <= 2 1906 0.0781 121 0.0635 0.2740 0.0066 16.5222 | |
# 03 $X > 2 & $X <= 4 2231 0.0914 176 0.0789 0.5078 0.0298 10.6623 | |
# 04 $X > 4 3014 0.1235 269 0.0893 0.6426 0.0685 0.0000 |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.