R Credit Scoring – WoE & Information Value in woe Package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In credit scoring, Information Value (IV) is frequently used to compare predictive power among variables. When developing new scorecards using logistic regression, variables are often binned and recoded using WoE concept. Package riv will help you to assess predicive power of variables, assess WoE patterns and recode raw variables to WoE.
Introduction
I assume that reader has some basic experience in credit scoring. One of our goals when binning variables is to maximize Information Value. Weight of Evidence (WoE) for single bin is defined as:
Information value for variable is defined as:
where n is number of variables.
To ilustrate the concept, here is an example for variable age from german credit scoring dataset:
Class | Good | Bad | %Good | %Bad | Odds | WoE | MIV |
(;25.5) | 110 | 80 | 15,7% | 26,7% | 0,59 | -0,53 | 0,06 |
<25.5;27.5) | 74 | 27 | 10,6% | 9,0% | 1,17 | 0,16 | 0,00 |
<27.5;34.5) | 172 | 85 | 24,6% | 28,3% | 0,87 | -0,14 | 0,01 |
<34.5;38.5) | 108 | 24 | 15,4% | 8,0% | 1,93 | 0,66 | 0,05 |
<38.5;) | 236 | 84 | 33,7% | 28,0% | 1,20 | 0,19 | 0,01 |
IV: | 0,13 |
Total Information Value of 0,13 indicate medium predictive power.
riv Package
riv package will help you to analyze WoE patterns and Information Value for whole modeling dataset. Main features are:
- calculate Information Value for variable(s)
- recode original variables to WoE
- plot WoE patterns for variable(s)
- plot Information Value for variable(s)
One of the best features of riv package is automated binning of numeric variables. This uses rpart package and allows the user to pass specific rpart.control() values. For testing “German Credit Data” dataset is used. This dataset is also part of the package.
Install package
riv is located on github and I prefer to use devtools for installation:
library(devtools) install_github("riv","tomasgreif") library(woe)
Calculate Information Value
We can use function iv.mult() to calculate Information Value for all variables in data frame:
iv.mult(german_data,"gb",TRUE)
This will print the following table:
Variable InformationValue Bins ZeroBins Strength 1 ca_status 0.666011503 4 0 Very strong 2 credit_history 0.293233547 5 0 Strong 3 duration 0.259146834 5 0 Strong 4 credit_amount 0.207970035 5 0 Strong 5 savings 0.196009557 5 0 Average 6 purpose 0.169195066 10 0 Average 7 age 0.125210683 5 0 Average 8 property 0.112638262 4 0 Average 9 present_employment_since 0.086433631 5 0 Weak 10 housing 0.083293434 3 0 Weak 11 other_installment 0.057614542 3 0 Weak 12 status_sex 0.044670678 5 1 Weak 13 foreign_worker 0.043877412 2 0 Weak 14 other_debtors 0.032019322 3 0 Weak 15 installment_rate_income 0.023858552 2 0 Weak 16 existing_credits 0.010083557 2 0 Wery weak 17 job 0.008762766 4 0 Wery weak 18 telephone 0.006377605 2 0 Wery weak 19 liable_maintenance_people 0.000000000 1 0 Wery weak 20 present_residence_since 0.000000000 1 0 Wery weak
We can see there are five columns in output – variable name, information value, number of bins, number of bins where count of either good or bad is zero and overall assessment of predictive strength. Variables duration, credit_amount and age are numeric and riv fitted rpart model to find best possible binning.
Plot Results
There is a simple function iv.plot.summary() that we will use to plot results of iv.mult():
iv.plot.summary(iv.mult(german_data,"gb",TRUE))
This will result in:
Analyze individual variables
In scorecard development it is important for WoE to have logical trend among bins. With riv you can analyze WoE patterns for one ore more variables. If you need only specific variables, you can use vars parameter:
options(digits=2) iv.mult(german_data,"gb",vars=c("housing","duration"))
Will result in:
[[1]] variable class outcome_0 outcome_1 pct_1 pct_0 odds woe miv 1 housing rent 109 70 0.23 0.156 0.67 -0.40 0.031 2 housing own 527 186 0.62 0.753 1.21 0.19 0.026 3 housing for free 64 44 0.15 0.091 0.62 -0.47 0.026 [[2]] variable class outcome_0 outcome_1 pct_1 pct_0 odds woe miv 1 duration (;11.5) 153 27 0.09 0.22 2.43 0.887 1.1e-01 2 duration <11.5;15.5) 189 62 0.21 0.27 1.31 0.267 1.7e-02 3 duration <15.5;19) 72 43 0.14 0.10 0.72 -0.332 1.3e-02 4 duration <19;34.5) 198 86 0.29 0.28 0.99 -0.013 5.1e-05 5 duration <34.5;) 88 82 0.27 0.13 0.46 -0.777 1.1e-01
Columns description:
- variable - variable name
- class - name of bin (interval from rpart tree for numeric variables, variable value otherwise)
- outcome_0 - number of good observations
- outcome_1 - number of bad observations
- pct_1 - good observations in bin / total good observations
- pct_0 - bad observations in bin / total bad observations
- odds - pct_1/pct_0
- woe - Weight of Evidence - calculated as ln(odds)
- miv - Marginal Information Value - calcualted as ln(odds) * (pct_0 - pct_1)
You can also plot WoE patterns iv.plot.woe() function:
iv.plot.woe(iv.mult(german_data,"gb",vars=c("housing","duration"),summary=FALSE))
Control rpart parameters
For numeric variables you can pass your own rpart.control(). I will ilustrate this for variable duration and complexity parameter cp:
iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.02)) iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.005)) iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.001))
This is result for previous commands. Note how number of leafs is increasing with decreasing cp:
variable class outcome_0 outcome_1 pct_1 pct_0 odds woe miv 1 duration (;34.5) 612 218 0.73 0.87 1.20 0.18 0.027 2 duration <34.5;) 88 82 0.27 0.13 0.46 -0.78 0.115 variable class outcome_0 outcome_1 pct_1 pct_0 odds woe miv 1 duration (;11.5) 153 27 0.09 0.22 2.43 0.887 0.11408 2 duration <11.5;34.5) 459 191 0.64 0.66 1.03 0.029 0.00056 3 duration <34.5;) 88 82 0.27 0.13 0.46 -0.777 0.11465 variable class outcome_0 outcome_1 pct_1 pct_0 odds woe miv 1 duration (;8.5) 84 10 0.0333 0.120 3.60 1.281 1.1e-01 2 duration <8.5;9.5) 35 14 0.0467 0.050 1.07 0.069 2.3e-04 3 duration <9.5;11.5) 34 3 0.0100 0.049 4.86 1.580 6.1e-02 4 duration <11.5;12.5) 130 49 0.1633 0.186 1.14 0.128 2.9e-03 5 duration <12.5;15.5) 59 13 0.0433 0.084 1.95 0.665 2.7e-02 6 duration <15.5;19) 72 43 0.1433 0.103 0.72 -0.332 1.3e-02 7 duration <19;20.5) 7 1 0.0033 0.010 3.00 1.099 7.3e-03 8 duration <20.5;34.5) 191 85 0.2833 0.273 0.96 -0.038 3.9e-04 9 duration <34.5;37.5) 46 37 0.1233 0.066 0.53 -0.630 3.6e-02 10 duration <37.5;43.5) 12 5 0.0167 0.017 1.03 0.028 1.3e-05 11 duration <43.5;) 30 40 0.1333 0.043 0.32 -1.135 1.0e-01
Recoding Variables
Before running logistic regression model we would like to recode variables to WoE. For this task, we use function iv.replace.woe(). I will use smaller dataset to ilustrate this:
> german_data_small <- german_data[c language="("duration","ca_status","credit_amount","gb")"][/c] > str(german_data_small) 'data.frame': 1000 obs. of 4 variables: $ duration : int 6 48 12 42 24 36 24 36 12 30 ... $ ca_status : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 2 4 1 1 4 4 2 4 2 ... $ credit_amount: int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ... $ gb : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ... > german_data_small_woe <- iv.replace.woe(german_data_small,iv=iv.mult(german_data_small,"gb")) > str(german_data_small_woe) 'data.frame': 1000 obs. of 7 variables: $ duration : int 6 6 6 6 6 6 6 6 6 6 ... $ ca_status : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 1 1 1 1 1 1 1 1 1 ... $ credit_amount : int 338 343 428 448 609 662 666 860 1169 1198 ... $ gb : Factor w/ 2 levels "bad","good": 2 2 2 1 2 2 2 2 2 1 ... $ duration_woe : num 0.887 0.887 0.887 0.887 0.887 ... $ ca_status_woe : num -0.818 -0.818 -0.818 -0.818 -0.818 ... $ credit_amount_woe: num -0.076 -0.076 -0.076 -0.076 -0.076 ...
You see that function iv.replace.woe() added three columns duration_woe, ca_status_woe and credit_amount_woe
Using help
Because woe is standard R package, there is documentation for every function. This is complete list of available functions:
- iv.num - calculate WoE/IV for numeric variables
- iv.str - calculate WoE/IV for character/factor variables
- iv.mult - calculate WoE/IV, summary IV for one or more variables
- iv.plot.summary - plot IV summary
- iv.plot.woe - plot WoE patterns for one or more variables
- iv.replace.woe - recode original variables to WoE (adds new columns)
Final thoughts
I created this package mainly for learning purpose. It was fun learning how to use github, devtools and Rstudio to create a package. In another post there is short tutorial how to start your own package. You can also fork riv on github and improve this package on your own or commit changes to my repository. I appreciate any feedback or comments.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.