A Prototype of Monotonic Binning Algorithm with R
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (http://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (http://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick prototype with 20 lines of R code, which is however barely useable without the further polish. But it is still a little surprising to me how efficient it can be to use R in algorithm prototyping, much sleeker than SAS macro.
library(sas7bdat) library(Hmisc) bin <- function(x, y){ n <- min(50, length(unique(x))) repeat { n <- n - 1 d1 <- data.frame(x, y, bin = cut2(x, g = n)) d2 <- aggregate(d1[-3], d1[3], mean) cor <- cor(d2[-1], method = "spearman") if(abs(cor[1, 2]) == 1) break } d2[2] <- NULL colnames(d2) <- c('LEVEL', 'RATE') head <- paste(toupper(substitute(y)), " RATE by ", toupper(substitute(x)), sep = '') cat("+-", rep("-", nchar(head)), "-+\n", sep = '') cat("| ", head, ' |\n', sep = '') cat("+-", rep("-", nchar(head)), "-+\n", sep = '') print(d2) cat("\n") } data <- read.sas7bdat("C:\\Users\\liuwensui\\Downloads\\accepts.sas7bdat") attach(data) bin(bureau_score, bad) bin(age_oldest_tr, bad) bin(tot_income, bad) bin(tot_tr, bad)
R output:
+--------------------------+ | BAD RATE by BUREAU_SCORE | +--------------------------+ LEVEL RATE 1 [443,618) 0.44639376 2 [618,643) 0.38446602 3 [643,658) 0.31835938 4 [658,673) 0.23819302 5 [673,686) 0.19838057 6 [686,702) 0.17850288 7 [702,715) 0.14168378 8 [715,731) 0.09815951 9 [731,752) 0.07212476 10 [752,776) 0.05487805 11 [776,848] 0.02605210 +---------------------------+ | BAD RATE by AGE_OLDEST_TR | +---------------------------+ LEVEL RATE 1 [ 1, 34) 0.33333333 2 [ 34, 62) 0.30560928 3 [ 62, 87) 0.25145068 4 [ 87,113) 0.23346304 5 [113,130) 0.21616162 6 [130,149) 0.20036101 7 [149,168) 0.19361702 8 [168,198) 0.15530303 9 [198,245) 0.11111111 10 [245,308) 0.10700389 11 [308,588] 0.08730159 +------------------------+ | BAD RATE by TOT_INCOME | +------------------------+ LEVEL RATE 1 [ 0, 2570) 0.2498715 2 [2570, 4510) 0.2034068 3 [4510,8147167] 0.1602327 +--------------------+ | BAD RATE by TOT_TR | +--------------------+ LEVEL RATE 1 [ 0,12) 0.2672370 2 [12,22) 0.1827676 3 [22,77] 0.1422764
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.