[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (http://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (http://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick prototype with 20 lines of R code, which is however barely useable without the further polish. But it is still a little surprising to me how efficient it can be to use R in algorithm prototyping, much sleeker than SAS macro.
library(sas7bdat)
library(Hmisc)
bin <- function(x, y){
n <- min(50, length(unique(x)))
repeat {
n <- n - 1
d1 <- data.frame(x, y, bin = cut2(x, g = n))
d2 <- aggregate(d1[-3], d1[3], mean)
cor <- cor(d2[-1], method = "spearman")
if(abs(cor[1, 2]) == 1) break
}
d2[2] <- NULL
colnames(d2) <- c('LEVEL', 'RATE')
head <- paste(toupper(substitute(y)), " RATE by ", toupper(substitute(x)), sep = '')
cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
cat("| ", head, ' |\n', sep = '')
cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
print(d2)
cat("\n")
}
data <- read.sas7bdat("C:\\Users\\liuwensui\\Downloads\\accepts.sas7bdat")
attach(data)
bin(bureau_score, bad)
bin(age_oldest_tr, bad)
bin(tot_income, bad)
bin(tot_tr, bad)
R output:
+--------------------------+
| BAD RATE by BUREAU_SCORE |
+--------------------------+
LEVEL RATE
1 [443,618) 0.44639376
2 [618,643) 0.38446602
3 [643,658) 0.31835938
4 [658,673) 0.23819302
5 [673,686) 0.19838057
6 [686,702) 0.17850288
7 [702,715) 0.14168378
8 [715,731) 0.09815951
9 [731,752) 0.07212476
10 [752,776) 0.05487805
11 [776,848] 0.02605210
+---------------------------+
| BAD RATE by AGE_OLDEST_TR |
+---------------------------+
LEVEL RATE
1 [ 1, 34) 0.33333333
2 [ 34, 62) 0.30560928
3 [ 62, 87) 0.25145068
4 [ 87,113) 0.23346304
5 [113,130) 0.21616162
6 [130,149) 0.20036101
7 [149,168) 0.19361702
8 [168,198) 0.15530303
9 [198,245) 0.11111111
10 [245,308) 0.10700389
11 [308,588] 0.08730159
+------------------------+
| BAD RATE by TOT_INCOME |
+------------------------+
LEVEL RATE
1 [ 0, 2570) 0.2498715
2 [2570, 4510) 0.2034068
3 [4510,8147167] 0.1602327
+--------------------+
| BAD RATE by TOT_TR |
+--------------------+
LEVEL RATE
1 [ 0,12) 0.2672370
2 [12,22) 0.1827676
3 [22,77] 0.1422764
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
