Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While looking for a interesting Machine Learning exercise I decided to go along with credit scoring exercise. I want to know what kind of information influences the decision for giving someone credit (or not). Typically, a bank would ask you to fill in some kind of assesment form with question about demographics, purpose of the loan, your status of employment and salary. Today, this is not a standard proces.
The problem here is we want to predict the creditability of new clients based on clients from the past. Client whom we already know if they are credit-worthy or not. One way to solve this problem is with a machine learning such as decision trees. Other algorithms can be used aswell, but decision trees are more comprehensive at the end as I want to know what kind of information has the most influence in this determination proces.
In this example we already know which client are credit-worthy and which are not (= qualitative characteristic), based on 21 quantitative characteristics. All we have to do is to find (alot of) tests/decisions that partitions those clients as well as possible. Hereby, we want decisions that at the end group the good or bad credit-worthy clients as pure as possible. For each possible decisions (or split in the decision tree) we can define a measure of expected purity. During these path of decisions we follow those decisisons with the most Information Gain, thus reducing the missing information the most.
Get the data
The dataset I’m using has already been preprocessed. If you would like the original dataset take a look at the UCI Machine Learning Repository. The dataset holds both numeric and categoric variables, but has been mapped to only numeric variables. See the appendix here for the final categorical classification.
# Used packages require(dplyr) # for most data manipulation require(caret) # acces to alot of predictors require(ggplot2) # nice plots require(gridExtra) # multiplots # Loading data credit <- read.csv("datasets/german_credit.csv", stringsAsFactors = FALSE) # Remove . dots in names names(credit) <- gsub(pattern = "\\.", x = names(credit), replacement = "") # Make target class factor credit$Creditability <- factor(credit$Creditability, levels = c(1, 0), labels = c("Good", "Bad"))
Exploring the data
Quick overview of few characteristics to get an idea of the dataset:
# dimension: 20 quant. characteristics and 1 target class (creditability) dim(credit) ## [1] 1000 21 # show few records credit[18:24, 1:6] ## Creditability AccountBalance DurationofCreditmonth ## 18 Good 2 18 ## 19 Good 2 36 ## 20 Good 4 11 ## 21 Good 1 6 ## 22 Good 2 12 ## 23 Bad 2 36 ## 24 Good 2 12 ## PaymentStatusofPreviousCredit Purpose CreditAmount ## 18 2 3 3213 ## 19 4 3 2337 ## 20 4 0 7228 ## 21 4 0 3676 ## 22 4 0 3124 ## 23 2 5 2384 ## 24 4 4 1424
A bivariate visualization with summary statistics can provide usefull insights of the dataset, as well at the target variable. Most people asking for a credit is the group between 20-40 years but it not so obvious to decide which age is better to get a credit or not.
This multivariate visualizations of only a few variables shows that it’s not that simple to group the classes of creditability (good/bad). All points lie upon each other for the characteristics credit amount, duration of credit and age.
## Loading required package: AppliedPredictiveModeling
Let’s look at the summary of the qualitative characteristic
summary(credit$Creditability) ## Good Bad ## 700 300
The dataset is slightly unbalanced to learn from.
Preparing the data
Before training machine learning algorithms we need to split up the dataset into a training- and testset. This can be done with base R
, dplyr
or with caret
’s function createDataPartition
, which will take a stratified sample. Thus, respecting the distribution of our target label in both sets.
# Split full dataset into a training and test set set.seed(2) # dplyr credit = sample_n(credit, size = nrow(credit)) training = sample_frac(credit, size =0.7) testing = setdiff(credit, training) # caret TrainPart <- createDataPartition(y = credit$Creditability, times = 1, p = 0.7, list = FALSE) training = credit[TrainPart,] testing = credit[-TrainPart,]
Training a model
Very popular classification algorithms are C4.5
and C5.0
. For this problem I choose the C5.0
because it incorporates a misclassification cost and C4.5
doesn’t have this feature. This mean that I don’t want all errors treated as equal but more like this: If the model predicts “good credithworthiness” but it is actual a “bad” one, these errors should get a higher cost in order to avoid them while learning our decision tree.
Train
Our model has learned some ruleset that can be visualized as a tree. Note that not all leaves (terminal nodes) are pure!
We can also take a look at the overal predictor importance, since I’m mostly interested in this.
# ?C5imp C5varImp <- C5imp(credit_model, metric = "usage") C5varImp ## Overall ## AccountBalance 100.00 ## Typeofapartment 59.29 ## DurationofCreditmonth 58.14 ## PaymentStatusofPreviousCredit 53.57 ## Guarantors 51.14 ## Mostvaluableavailableasset 48.86 ## ConcurrentCredits 43.86 ## CreditAmount 39.14 ## Lengthofcurrentemployment 38.86 ## Purpose 32.86 ## Occupation 26.57 ## SexMaritalStatus 23.00 ## Instalmentpercent 15.57 ## ValueSavingsStocks 12.00 ## DurationinCurrentaddress 11.29 ## Ageyears 8.57 ## Telephone 8.43 ## Noofdependents 5.86 ## NoofCreditsatthisBank 3.71 ## ForeignWorker 3.71
More details about the model are shown with summary(credit_model)
. The model has an error rate of 6.9% (48/700 examples misclassified). A total of 34 actual bad creditability were incorrectly classified as good credithworthiness (false positives), while 14 good creditability were misclassified as bad ones (false negatives). The error rate on the training data is probably very optimistic, so lets evaluate this on the testset.
Evaluating model performance on new examples
# run predict model on new examples predClass <- predict.C5.0(credit_model, testing) # Confusion matrix require(gmodels) ## Loading required package: gmodels CrossTable(testing$Creditability, predClass, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default')) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 300 ## ## ## | predicted default ## actual default | Good | Bad | Row Total | ## ---------------|-----------|-----------|-----------| ## Good | 173 | 37 | 210 | ## | 0.577 | 0.123 | | ## ---------------|-----------|-----------|-----------| ## Bad | 55 | 35 | 90 | ## | 0.183 | 0.117 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 228 | 72 | 300 | ## ---------------|-----------|-----------|-----------| ## ##
On the testset a total of 92/300 cases are misclassified. Thus, this gives us an error rate of 30%. This is a lot higher than the 6.9% error rate from the traininset. The model is about 70% accurate… However, this metric is not ideal to use with unbalanced data, but gives a sense of the overall prediction.
Methods to improve the model: Boosting
Boosting algorithms consist of iteratively learning weak classifiers and adding them together to end with a final strong classifier. During each iteration, weights are given to those examples that are misclassified whereas examples lose weights when classified correctly. Thus, the next iteration focusses more on the examples that previously were misclassified.
The C5.0
algorithm has a boost functionality that can be controlled with the trails
parameter.
credit_boost <- C5.0(training[-1], training$Creditability, trials = 15) # an integer specifying the number of boosting iterations # summary(credit_boost) boostPred <- predict.C5.0(credit_boost, testing, trials = credit_boost$trials["Actual"]) # Confusio matrix CrossTable(testing$Creditability, boostPred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default')) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 300 ## ## ## | predicted default ## actual default | Good | Bad | Row Total | ## ---------------|-----------|-----------|-----------| ## Good | 180 | 30 | 210 | ## | 0.600 | 0.100 | | ## ---------------|-----------|-----------|-----------| ## Bad | 47 | 43 | 90 | ## | 0.157 | 0.143 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 227 | 73 | 300 | ## ---------------|-----------|-----------|-----------| ## ##
Error rate (on test set) in now down to 25.6%. Only a minor improvement is achieved with boosting.
Costmatrix
Let’s implement a cost matrix so that not all errors are treated as equal but implements asymmetric costs for specific errors. First, we need to define a cost matrix and specify the dimension. We need a 2×2 cost matrix since the predicted and actual values form a 2×2 matrix.
# Make a cost matrix matrix_dimensions <- list(c("Bad", "Good"), c("Bad", "Good")) names(matrix_dimensions) <- c("predicted", "actual") # Assing a costmatrix to error_cost error_cost <- matrix(c(0, 4, 1, 0), nrow = 2, dimnames = matrix_dimensions) # Use the costmatrix and train the model again credit_cost <- C5.0(training[-1], training$Creditability, costs = error_cost, trails = 10)
Take a look at the confusion matrix:
credit_cost_pred <- predict.C5.0(credit_cost, testing, type = "class") CrossTable(testing$Creditability, credit_cost_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default')) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 300 ## ## ## | predicted default ## actual default | Good | Bad | Row Total | ## ---------------|-----------|-----------|-----------| ## Good | 121 | 89 | 210 | ## | 0.403 | 0.297 | | ## ---------------|-----------|-----------|-----------| ## Bad | 24 | 66 | 90 | ## | 0.080 | 0.220 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 145 | 155 | 300 | ## ---------------|-----------|-----------|-----------| ## ## postResample(credit_cost_pred, testing$Creditability) ## Accuracy Kappa ## 0.6233333 0.2565789
This model has lower accuracy (62.3%) with more mistakes. However, the type of errors are very different. This time it will try to reduce the costly predictions (actual bad creditability but predicted as good).
To-do
- try other methods like
randomForest
, explore thecaret
function - try other meta-algorithms like bagging, cross-validation
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.