R Credit Scoring – WoE & Information Value in woe Package

Posted on July 23, 2013 by Tomáš Greif in R bloggers | 0 Comments

[This article was first published on R (en) - Analytik dat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In credit scoring, Information Value (IV) is frequently used to compare predictive power among variables. When developing new scorecards using logistic regression, variables are often binned and recoded using WoE concept. Package riv will help you to assess predicive power of variables, assess WoE patterns and recode raw variables to WoE.

Introduction

I assume that reader has some basic experience in credit scoring. One of our goals when binning variables is to maximize Information Value. Weight of Evidence (WoE) for single bin is defined as:

Information value for variable is defined as:

where n is number of variables.

To ilustrate the concept, here is an example for variable age from german credit scoring dataset:

Class	Good	Bad	%Good	%Bad	Odds	WoE	MIV
(;25.5)	110	80	15,7%	26,7%	0,59	-0,53	0,06
<25.5;27.5)	74	27	10,6%	9,0%	1,17	0,16	0,00
<27.5;34.5)	172	85	24,6%	28,3%	0,87	-0,14	0,01
<34.5;38.5)	108	24	15,4%	8,0%	1,93	0,66	0,05
<38.5;)	236	84	33,7%	28,0%	1,20	0,19	0,01
						IV:	0,13

Total Information Value of 0,13 indicate medium predictive power.

riv Package

riv package will help you to analyze WoE patterns and Information Value for whole modeling dataset. Main features are:

calculate Information Value for variable(s)
recode original variables to WoE
plot WoE patterns for variable(s)
plot Information Value for variable(s)

One of the best features of riv package is automated binning of numeric variables. This uses rpart package and allows the user to pass specific rpart.control() values. For testing “German Credit Data” dataset is used. This dataset is also part of the package.

Install package

riv is located on github and I prefer to use devtools for installation:

library(devtools)
install_github("riv","tomasgreif")
library(woe)

Calculate Information Value

We can use function iv.mult() to calculate Information Value for all variables in data frame:

iv.mult(german_data,"gb",TRUE)

This will print the following table:

                    Variable InformationValue Bins ZeroBins    Strength
1                  ca_status      0.666011503    4        0 Very strong
2             credit_history      0.293233547    5        0      Strong
3                   duration      0.259146834    5        0      Strong
4              credit_amount      0.207970035    5        0      Strong
5                    savings      0.196009557    5        0     Average
6                    purpose      0.169195066   10        0     Average
7                        age      0.125210683    5        0     Average
8                   property      0.112638262    4        0     Average
9   present_employment_since      0.086433631    5        0        Weak
10                   housing      0.083293434    3        0        Weak
11         other_installment      0.057614542    3        0        Weak
12                status_sex      0.044670678    5        1        Weak
13            foreign_worker      0.043877412    2        0        Weak
14             other_debtors      0.032019322    3        0        Weak
15   installment_rate_income      0.023858552    2        0        Weak
16          existing_credits      0.010083557    2        0   Wery weak
17                       job      0.008762766    4        0   Wery weak
18                 telephone      0.006377605    2        0   Wery weak
19 liable_maintenance_people      0.000000000    1        0   Wery weak
20   present_residence_since      0.000000000    1        0   Wery weak

We can see there are five columns in output – variable name, information value, number of bins, number of bins where count of either good or bad is zero and overall assessment of predictive strength. Variables duration, credit_amount and age are numeric and riv fitted rpart model to find best possible binning.

Plot Results

There is a simple function iv.plot.summary() that we will use to plot results of iv.mult():

iv.plot.summary(iv.mult(german_data,"gb",TRUE))

This will result in:

Analyze individual variables

In scorecard development it is important for WoE to have logical trend among bins. With riv you can analyze WoE patterns for one ore more variables. If you need only specific variables, you can use vars parameter:

options(digits=2)
iv.mult(german_data,"gb",vars=c("housing","duration"))

Will result in:

[[1]]
  variable    class outcome_0 outcome_1 pct_1 pct_0 odds   woe   miv
1  housing     rent       109        70  0.23 0.156 0.67 -0.40 0.031
2  housing      own       527       186  0.62 0.753 1.21  0.19 0.026
3  housing for free        64        44  0.15 0.091 0.62 -0.47 0.026
[[2]]
  variable       class outcome_0 outcome_1 pct_1 pct_0 odds    woe     miv
1 duration     (;11.5)       153        27  0.09  0.22 2.43  0.887 1.1e-01
2 duration <11.5;15.5)       189        62  0.21  0.27 1.31  0.267 1.7e-02
3 duration   <15.5;19)        72        43  0.14  0.10 0.72 -0.332 1.3e-02
4 duration   <19;34.5)       198        86  0.29  0.28 0.99 -0.013 5.1e-05
5 duration     <34.5;)        88        82  0.27  0.13 0.46 -0.777 1.1e-01

Columns description:

variable - variable name
class - name of bin (interval from rpart tree for numeric variables, variable value otherwise)
outcome_0 - number of good observations
outcome_1 - number of bad observations
pct_1 - good observations in bin / total good observations
pct_0 - bad observations in bin / total bad observations
odds - pct_1/pct_0
woe - Weight of Evidence - calculated as ln(odds)
miv - Marginal Information Value - calcualted as ln(odds) * (pct_0 - pct_1)

You can also plot WoE patterns iv.plot.woe() function:

iv.plot.woe(iv.mult(german_data,"gb",vars=c("housing","duration"),summary=FALSE))

Control rpart parameters

For numeric variables you can pass your own rpart.control(). I will ilustrate this for variable duration and complexity parameter cp:

iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.02))
iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.005))
iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.001))

This is result for previous commands. Note how number of leafs is increasing with decreasing cp:

variable   class outcome_0 outcome_1 pct_1 pct_0 odds   woe   miv
1 duration (;34.5)       612       218  0.73  0.87 1.20  0.18 0.027
2 duration <34.5;)        88        82  0.27  0.13 0.46 -0.78 0.115
variable       class outcome_0 outcome_1 pct_1 pct_0 odds    woe     miv
1 duration     (;11.5)       153        27  0.09  0.22 2.43  0.887 0.11408
2 duration <11.5;34.5)       459       191  0.64  0.66 1.03  0.029 0.00056
3 duration     <34.5;)        88        82  0.27  0.13 0.46 -0.777 0.11465
   variable       class outcome_0 outcome_1  pct_1 pct_0 odds    woe     miv
1  duration      (;8.5)        84        10 0.0333 0.120 3.60  1.281 1.1e-01
2  duration   <8.5;9.5)        35        14 0.0467 0.050 1.07  0.069 2.3e-04
3  duration  <9.5;11.5)        34         3 0.0100 0.049 4.86  1.580 6.1e-02
4  duration <11.5;12.5)       130        49 0.1633 0.186 1.14  0.128 2.9e-03
5  duration <12.5;15.5)        59        13 0.0433 0.084 1.95  0.665 2.7e-02
6  duration   <15.5;19)        72        43 0.1433 0.103 0.72 -0.332 1.3e-02
7  duration   <19;20.5)         7         1 0.0033 0.010 3.00  1.099 7.3e-03
8  duration <20.5;34.5)       191        85 0.2833 0.273 0.96 -0.038 3.9e-04
9  duration <34.5;37.5)        46        37 0.1233 0.066 0.53 -0.630 3.6e-02
10 duration <37.5;43.5)        12         5 0.0167 0.017 1.03  0.028 1.3e-05
11 duration     <43.5;)        30        40 0.1333 0.043 0.32 -1.135 1.0e-01

Recoding Variables

Before running logistic regression model we would like to recode variables to WoE. For this task, we use function iv.replace.woe(). I will use smaller dataset to ilustrate this:

> german_data_small <- german_data[c language="("duration","ca_status","credit_amount","gb")"][/c]
> str(german_data_small)
'data.frame':	1000 obs. of  4 variables:
 $ duration     : int  6 48 12 42 24 36 24 36 12 30 ...
 $ ca_status    : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ credit_amount: int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ gb           : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
> german_data_small_woe <- iv.replace.woe(german_data_small,iv=iv.mult(german_data_small,"gb"))
> str(german_data_small_woe)
'data.frame':	1000 obs. of  7 variables:
 $ duration         : int  6 6 6 6 6 6 6 6 6 6 ...
 $ ca_status        : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ credit_amount    : int  338 343 428 448 609 662 666 860 1169 1198 ...
 $ gb               : Factor w/ 2 levels "bad","good": 2 2 2 1 2 2 2 2 2 1 ...
 $ duration_woe     : num  0.887 0.887 0.887 0.887 0.887 ...
 $ ca_status_woe    : num  -0.818 -0.818 -0.818 -0.818 -0.818 ...
 $ credit_amount_woe: num  -0.076 -0.076 -0.076 -0.076 -0.076 ...

You see that function iv.replace.woe() added three columns duration_woe, ca_status_woe and credit_amount_woe

Using help

Because woe is standard R package, there is documentation for every function. This is complete list of available functions:

iv.num - calculate WoE/IV for numeric variables
iv.str - calculate WoE/IV for character/factor variables
iv.mult - calculate WoE/IV, summary IV for one or more variables
iv.plot.summary - plot IV summary
iv.plot.woe - plot WoE patterns for one or more variables
iv.replace.woe - recode original variables to WoE (adds new columns)

Final thoughts

I created this package mainly for learning purpose. It was fun learning how to use github, devtools and Rstudio to create a package. In another post there is short tutorial how to start your own package. You can also fork riv on github and improve this package on your own or commit changes to my repository. I appreciate any feedback or comments.

To leave a comment for the author, please follow the link and comment on their blog: R (en) - Analytik dat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

R Credit Scoring – WoE & Information Value in woe Package

riv Package

Install package

Calculate Information Value

Recoding Variables

Using help

Final thoughts

Related

riv Package

Install package

Calculate Information Value

Recoding Variables

Using help

Final thoughts

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)