Improving Binning by Bootstrap Bumping
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the post (https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression), a more robust version of monotonic binning based on the isotonic regression was introduced. Nonetheless, due to the loss of granularity, the predictability has been somewhat compromised, which is a typical dilemma in the data science. On one hand, we don’t want to use a learning algorithm that is too greedy and therefore over-fits the data at the cost of simplicity and generality. On the other hand, we’d also like to get the most predictive power out of our data for better business results.
It is worth mentioning that, although there is a consensus that advanced ensemble algorithms are able to significantly improve the prediction outcome, both bagging and boosting would also destroy the simple structure of binning outputs and therefore might not be directly applicable in this simple case.
In light of above considerations, the bumping (Bootstrap Umbrella of Model Parameters) procedure, which was detailed in Model Search And Inference By Bootstrap Bumping by Tibshirani and Knight (1997), should serve our dual purposes. First of all, since the final binning structure would be derived from an isotonic regression based on the bootstrap sample, the concern about over-fitting the original training data can be addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved. It is noted that, since the original sample is always included in the bumping procedure, a binning outcome with bumping that is at least as good as the one without is guaranteed.
The R function bump_bin() is my effort of implementing the bumping procedure on top of the monotonic binning function based on isotonic regression. Because of the mutual independence of binning across all bootstrap samples, the bumping is a perfect use case of parallelism for the purpose of faster execution, as demonstrated in the function.
The output below shows the bumping result based on 20 bootstrap samples. There is a small improvement in the information value, e.g. 0.8055 vs 0.8021 without bumping, implying a potential opportunity of bumping with a simpler binning structure, e.g. 12 bins vs 20 bins.
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 565 92 41 51 92 41 51 0.0158 0.4457 0.5543 0.8039 -0.2183 -1.5742 0.0532 2 <= 620 470 269 201 562 310 252 0.0805 0.5723 0.4277 1.3383 0.2914 -1.0645 0.1172 3 <= 653 831 531 300 1393 841 552 0.1424 0.6390 0.3610 1.7700 0.5710 -0.7850 0.1071 4 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 5 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 6 <= 675 366 290 76 2154 1421 733 0.0627 0.7923 0.2077 3.8158 1.3391 -0.0168 0.0000 7 <= 699 805 649 156 2959 2070 889 0.1379 0.8062 0.1938 4.1603 1.4256 0.0696 0.0007 8 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 9 <= 716 321 278 43 3592 2616 976 0.0550 0.8660 0.1340 6.4651 1.8664 0.5105 0.0122 10 <= 721 181 159 22 3773 2775 998 0.0310 0.8785 0.1215 7.2273 1.9779 0.6219 0.0099 11 <= 755 851 789 62 4624 3564 1060 0.1458 0.9271 0.0729 12.7258 2.5436 1.1877 0.1403 12 755 898 867 31 5522 4431 1091 0.1538 0.9655 0.0345 27.9677 3.3311 1.9751 0.3178 13 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 14 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8055
The output below is based on bumping with 200 bootstrap samples. The information value has been improved by 2%, e.g. 0.8174 vs 0.8021, with a lower risk of over-fitting, e.g. 14 bins vs 20 bins.
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 559 79 34 45 79 34 45 0.0135 0.4304 0.5696 0.7556 -0.2803 -1.6362 0.0496 2 <= 633 735 428 307 814 462 352 0.1259 0.5823 0.4177 1.3941 0.3323 -1.0237 0.1684 3 <= 637 86 53 33 900 515 385 0.0147 0.6163 0.3837 1.6061 0.4738 -0.8822 0.0143 4 <= 653 493 326 167 1393 841 552 0.0845 0.6613 0.3387 1.9521 0.6689 -0.6870 0.0477 5 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 6 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 7 <= 679 504 397 107 2292 1528 764 0.0863 0.7877 0.2123 3.7103 1.3111 -0.0448 0.0002 8 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 9 <= 699 507 413 94 2959 2070 889 0.0869 0.8146 0.1854 4.3936 1.4802 0.1242 0.0013 10 <= 716 633 546 87 3592 2616 976 0.1084 0.8626 0.1374 6.2759 1.8367 0.4808 0.0216 11 <= 722 202 178 24 3794 2794 1000 0.0346 0.8812 0.1188 7.4167 2.0037 0.6478 0.0118 12 <= 746 619 573 46 4413 3367 1046 0.1060 0.9257 0.0743 12.4565 2.5222 1.1663 0.0991 13 <= 761 344 322 22 4757 3689 1068 0.0589 0.9360 0.0640 14.6364 2.6835 1.3276 0.0677 14 761 765 742 23 5522 4431 1091 0.1311 0.9699 0.0301 32.2609 3.4739 2.1179 0.2979 15 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 16 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8174
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.