Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Refining the credit model(s)
To continue with the creditworthiness case, I want to explore this case a little bit more by adding more algorithms such as boosting, winnowing, cross validation etc. Additionally, I’ll use randomforest
as classifier algorithm.
I’m still using the same german credit data as in the previous post. I’m also using the same train/testest. Each model is stored into one object models
.
# object that will store all the models in a list models <- list()
I start with three different models, which are all generated with the C5.0
algorithm. First model is a default model with no extra features. Second model is amplified with the boosting feature: instead of generating just one classifier it will generate several classifiers. After each iteration it will focus more on misclassified examples for reducing bias. The third model has the winnow
parameter set to TRUE
. Basicly, it will search over the 20 attributes of the dataset and pre-select a subset of attribute that will be used to construct the decision tree or ruleset. Read more at C5.0 tutorial
# C5.0 package set.seed(2) # train model baseMod <- C5.0( training[,-1], training$Creditability) # store basemodel into the models object models$baseMod <- baseMod # Using boosting with C5.0 model set.seed(2) # train model BoostMod <- C5.0( training[,-1], training$Creditability, trials = 100) # store boostmod into the models object models$BoostMod <- BoostMod # Using winnow and boosting set.seed(2) # train model WinnowMod <- C5.0( training[,-1], training$Creditability, control = C5.0Control(winnow = TRUE), trials = 100) # store winnowmod into the models object models$WinnowMod <- WinnowMod
So the models created thus far:
names(models) ## [1] "baseMod" "BoostMod" "WinnowMod"
Performance measures
After training, lets gather the performance of those models on new examples. With the ROCR
package we can do lots of performance tests such as: Area under the Curve, sensitivity, specificity, accuracy, etc. I made a few functions accuracyTester
, getPerformance
, getSensSpec
and getVarImportance
so I can run those function for each model in the models
object.
# function for returning accuracy on test dataset accuracyTester <- function(predictModel) { temp <- predict(predictModel, testing) postResample(temp, testing$Creditability) } # function for calculating performance # input for the ROC curve getPerformance <- function(modelName) { score <- predict(modelName, type= "prob", testing) pred <- prediction(score[,1], testing$Creditability) perf <- performance(pred, "tpr", "fpr") return(perf) } # function for calculating specificity and sensitivity getSensSpec <- function(modelName) { score <- predict(modelName, type= "prob", testing) pred <- prediction(score[,1], testing$Creditability) perf <- performance(pred, "sens", "spec") return(perf) }
ROC and Accuracy plot
One method to evaluate the models is by calculating the overall accuracy of each model. The BoostMod
has the highest accuracy.
A more reliable method to evaluate the model’s performace is the Receiver Operating Characteristics. It’s a well used visualization technique to evaluate binary classifiers. Predicting good or bad creditworthiness is indeed a binary classification. A ROC curve is created by plotting the true positive rate (TPR) or sensitivity against the false positive rate (FPR), thus it shows the tpr as a function of fpr.
For each fpr it is shown that the BoostMod
has the highest tpr.
Caret package
Another great package I found is the caret
package. It has an uniform interface to a lot of predictive algorithms. Also, it provides a generic approach for visualization, pre-processing, data-splitting, variable importance, model performance and parallel processing. This can be handy, since different modeling functions have different syntax for model training, predicting and parameter tuning.
Caret
has bindings to the C5.0
algorithm, therefor it will also tune the parameters boosting and winnowing. Another way to get a more reliable estimate of accuracy is by K-fold cross-validation. Just for illustration I will use a 10-fold cross validation, but will use it only on our training set. So I can use the testset for the other performance measures.
This image illustrates the mechanics of cross-validation:
# a list of values that define how this function acts ctrl <- trainControl(method = 'repeatedcv', # 10-fold cv number=10, # 10-fold cross-validation repeats=5) # 5-repeats # train model set.seed(2) cvMod <- train(form = Creditability ~., data = training, method = "C5.0", trControl = ctrl, tuneGrid = expand.grid(trials = 15, model = c("tree", "rules"), winnow = c(T,F))) # store rfmodel into the models object models$cvMod <- cvMod
So far I started exploring the construction of a single classification tree with the C5.0
packages. I tried to improve the performance by adding an ensemble learner (boosting). Looking at the ROC and Accuracy plot this seems to be the best performing model so far. Another ensemble learner can be done for example with the randomForest
package. Intead of using boosting or cross-validation it will use another technique called bagging (__b__ootstrap __agg__regating).
Here, I’m only learning a forest on the training set, so I can evaluate its performance just like the other models.
library(randomForest) # RANDOMFOREST set.seed(2) rfModel <- randomForest(form = Creditability ~., data = training, ntree=500, importance=T, proximity=T, keep.forest = TRUE ) # store rfmodel into the models object models$rfModel <- rfModel
A slight improvement can be seen on the ROC curve. The model with cross-validation and randomforest are slightly higher on the curve.
Another way to evaluate the ROC performance can be done by calculating the Area under the Curve. Again, both previous best models have the same and highest AUC.
baseMod | BoostMod | WinnowMod | cvMod | rfModel |
---|---|---|---|---|
0.66 | 0.77 | 0.73 | 0.79 | 0.79 |
The accuracy of rfModel has the highest accuracy, but at what cost?
I guess a bank will choose a more conservative approach and follows a strategy with a more precise prediction for bad creditworthiness. Thus, a bank will prefer to avoid more false positives (predicted good, actual bad) than false negatives (predicted bad, actual true). So I would assume banks will likely choose a model with a good specificity.
Given this notion a bank can evaluate its options by looking at both:
- sensitivity = \(number of true positives \over number of true positives + number of false negatives\)
- specificity = \(number of true negatives \over number of true negatives + number of false positives\)
For each given model:
cut | sens | spec | |
---|---|---|---|
baseMod | 0.78 | 0.81 | 0.50 |
BoostMod | 0.66 | 0.66 | 0.78 |
WinnowMod | 0.62 | 0.78 | 0.60 |
cvMod | 0.64 | 0.70 | 0.76 |
rfModel | 0.68 | 0.66 | 0.80 |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.