Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A short note on information-theoretic variable screening in R w. {information}.
Variable screening comes as an important step in the contemporary EDA for predictive modeling: what can we tell about the nature of the relationships between a set of predictors and the dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we start rolling them into a predictive model? In this blog post I will discuss two information-theoretic measures that are common in variable screening for binary classification and regression models in the credit risk arena (the fact being completely unrelated to the simple truth that they could do you some good in any other application of predictive modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value (IV) of a variable in respect to a binary outcome. Then I will illustrate their computation (it’s fairly easy, in fact) from the {Information} package in R.
Weight of Evidence
Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):
and assume your models M1, M2 of the world* are simply two discrete possible states of a binary variable Y, while the data are given by discrete distributions of some predictor X (or, X stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n, take the log:
and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence (WoE). In theory, any monotonic transformation of the odds would do, but the logarithm brings an intuitive advantage of obtaining a negative WoE when the odds are less than one and a positive one when they are higher than one. To simplify the setting where the analysis under consideration encompasses a binary dependent Y and a discrete (or binned continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:
where f denotes counts.
Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:
dataSet <- read.table(‘bank-additional-full.csv’,
header = T,
strip.white = F,
sep = “;”)str(dataSet)
table(dataSet$y)dataSet$y <- recode(dataSet$y,
‘yes’ = 1,
‘no’ = 0)
and we want to compute the WoE for, say, the age variable. Here it goes:
# – compute WOE for: dataSet$age
bins <- 10
q <- quantile(dataSet$age,
probs = c(1:(bins – 1)/bins),
na.rm = TRUE,
type = 3)cuts <- unique(q)
aggAge <- table(findInterval(dataSet$age,
vec = cuts,
rightmost.closed = FALSE),
dataSet$y)aggAge <- as.data.frame.matrix(aggAge)
aggAge$N <- rowSums(aggAge)
aggAge$WOE <- log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))
In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table() call provides for the conditional distributions like the ones shown in the table above. The computation of WoE is then straightforward – as exemplified in the last line. However, you want to spare yourself from computing the WoE in this way for many variables in the dataset, and that’s where {Information} in R comes handy; for the same dataset:
# – Information value: all variables
infoTables <- create_infotables(data = dataSet,
y = “y”,
bins = 10,
parallel = T)# – WOE table:
infoTables$Tables$age$WOE
with the respective data frames in infoTables$Tables standing for the variables in the dataset.
Information Value
A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:
In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.
Using {Information} in R, for the dataset under consideration:
# – Information value: all variables
infoTables <- create_infotables(data = dataSet,
y = “y”,
bins = 10,
parallel = T)# – Plot IV
plotFrame <- infoTables$Summary[order(-infoTables$Summary$IV), ]
plotFrame$Variable <- factor(plotFrame$Variable,levels = plotFrame$Variable[order(-plotFrame$IV)])
ggplot(plotFrame, aes(x = Variable, y = IV)) +
geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) +
ggtitle(“Information Value”) +
theme_bw() +
theme(plot.title = element_text(size = 10)) +
theme(axis.text.x = element_text(angle = 90))
You may have noted the usage of parallel = T in the create_infotables() call; the {Information} package will try to use all available cores to speed up the computations by default. Besides the basic package functionality that I have illustrated, the package provides a natural way of dealing with uplift models, where the computation of the IVs for the control vs. treatment designs is nicely automated. Cross-validation procedures are also built-in.
However, now that we know that we have a nice, working package for WoE and IV estimation in R, let’s restrain ourselves from using it to perform automatic feature selection for models like binary logistic regression. While the information-theoretic measures discussed here truly assess the predictive power of a predictor in binary classification, building a model that encompasses multiple terms model is another story. Do not get disappointed if you start figuring out how the AICs for the full models are still lower then those for the nested models obtained by feature selection based on the IV values; although they can provide useful guidelines, WoE and IV are not even meant to be used that way (I’ve tried… once with the dataset used in the previous examples, and then with the two {Information} built-in datasets; not too much of a success, as you may have guessed).
References
* For parametric models, you would need to integrate over the full parameter space, of course; taking the MLEs would result in obtaining the standard LR test.
** The dataset is considered in S. Moro, P. Cortez and P. Rita (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/, file: bank-additional.zip); a nice description of the dataset is found at: http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)
Goran S. Milovanović, Phd
Data Science Consultant, SmartCat
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.