Classification on the German Credit Database

arthur charpentier

6 years ago

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our data science course, this morning, we’ve use random forrest to improve prediction on the German Credit Dataset. The dataset is

> url="http://freakonometrics.free.fr/german_credit.csv"
> credit=read.csv(url, header = TRUE, sep = ",")

Almost all variables are treated a numeric, but actually, most of them are factors,

> str(credit)
'data.frame':	1000 obs. of  21 variables:
 $ Creditability   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Account.Balance : int  1 1 2 1 1 1 1 1 4 2 ...
 $ Duration        : int  18 9 12 12 12 10 8  ...
 $ Purpose         : int  2 0 9 0 0 0 0 0 3 3 ...

(etc). Let us convert categorical variables as factors,

> F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
> for(i in F) credit[,i]=as.factor(credit[,i])

Let us now create our training/calibration and validation/testing datasets, with proportion 1/3-2/3

> i_test=sample(1:nrow(credit),size=333)
> i_calibration=(1:nrow(credit))[-i_test]

The first model we can fit is a logistic regression, on selected covariates

> LogisticModel <- glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + 
Length.of.current.employment + 
Sex...Marital.Status, family=binomial, 
data = credit[i_calibration,])

Based on that model, it is possible to draw the ROC curve, and to compute the AUC (on ne validation dataset)

> fitLog <- predict(LogisticModel,type="response",
+                   newdata=credit[i_test,])
> library(ROCR)
> pred = prediction( fitLog, credit$Creditability[i_test])
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
> AUCLog1=performance(pred, measure = "auc")@y.values[[1]]
> cat("AUC: ",AUCLog1,"n")
AUC:  0.7340997

An alternative is to consider a logistic regression on all explanatory variables

> LogisticModel <- glm(Creditability ~ ., 
+  family=binomial, 
+  data = credit[i_calibration,])

We might overfit, here, and we should observe that on the ROC curve

> fitLog <- predict(LogisticModel,type="response",
+                   newdata=credit[i_test,])
> pred = prediction( fitLog, credit$Creditability[i_test])
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
> AUCLog2=performance(pred, measure = "auc")@y.values[[1]]
> cat("AUC: ",AUCLog2,"n")
AUC:  0.7609792

There is a slight improvement here, compared with the previous model, where only five explanatory variables were considered.

Consider now some regression tree (on all covariates)

> library(rpart)
> ArbreModel <- rpart(Creditability ~ ., 
+  data = credit[i_calibration,])

We can visualize the tree using

> library(rpart.plot)
> prp(ArbreModel,type=2,extra=1)

The ROC curve for that model is

> fitArbre <- predict(ArbreModel,
+                     newdata=credit[i_test,],
+                     type="prob")[,2]
> pred = prediction( fitArbre, credit$Creditability[i_test])
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
> AUCArbre=performance(pred, measure = "auc")@y.values[[1]]
> cat("AUC: ",AUCArbre,"n")
AUC:  0.7100323

As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions.

> library(randomForest)
> RF <- randomForest(Creditability ~ .,
+ data = credit[i_calibration,])
> fitForet <- predict(RF,
+                     newdata=credit[i_test,],
+                     type="prob")[,2]
> pred = prediction( fitForet, credit$Creditability[i_test])
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)
> AUCRF=performance(pred, measure = "auc")@y.values[[1]]
> cat("AUC: ",AUCRF,"n")
AUC:  0.7682367

Here this model is (slightly) better than the logistic regression. Actually, if we create many training/validation samples, and compare the AUC, we can observe that – on average – random forests perform better than logistic regressions,

> AUC=function(i){
+   set.seed(i)
+   i_test=sample(1:nrow(credit),size=333)
+   i_calibration=(1:nrow(credit))[-i_test]
+   LogisticModel <- glm(Creditability ~ ., 
+    family=binomial, 
+    data = credit[i_calibration,])
+   summary(LogisticModel)
+   fitLog <- predict(LogisticModel,type="response",
+                     newdata=credit[i_test,])
+   library(ROCR)
+   pred = prediction( fitLog, credit$Creditability[i_test])
+   AUCLog2=performance(pred, measure = "auc")@y.values[[1]] 
+   RF <- randomForest(Creditability ~ .,
+   data = credit[i_calibration,])
+   fitForet <- predict(RF,
+                       newdata=credit[i_test,],
+                       type="prob")[,2]
+   pred = prediction( fitForet, credit$Creditability[i_test])
+   AUCRF=performance(pred, measure = "auc")@y.values[[1]]
+   return(c(AUCLog2,AUCRF))
+ }
> A=Vectorize(AUC)(1:200)
> plot(t(A))

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.