Monotonic Binning with Smbinning Package

statcompute

5 years ago

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  n <- min(20, nrow(unique(d1[x])))
  repeat {
    d1$bin <- Hmisc::cut2(d1[, x], g = n)
    d2 <- aggregate(d1[-3], d1[3], mean)
    c <- cor(d2[-1], method = "spearman")
    if(abs(c[1, 2]) == 1 | n == 2) break
    n <- n - 1
  }
  d3 <- aggregate(d1[-3], d1[3], max)
  cuts <- d3[-length(d3[, 3]), 3]
  return(smbinning::smbinning.custom(d1, y, x, cuts))
}

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6    <= 96    684     482    202      5304       4304      1000 0.1172   0.7047  0.2953  2.3861 0.8697 -0.4863 0.0316
7     > 96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 30   2962    2495    467      2962       2495       467 0.5075   0.8423  0.1577 5.3426 1.6757  0.3198 0.0471
2     > 30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6   <= 761   1118    1033     85      4757       3689      1068 0.1915   0.9240  0.0760 12.1529 2.4976  1.1416 0.1730
7    > 761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.