Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post we’ll fit some predicitve models in (well know) data bases, and evalute the performance of each model. Disclaimer1: for simplicity the predictive variables are treating without apply any transformation to get a better performance or stability, etc. We’ll use two datas to evaluate the performances of the models. Both data have categorical and continous variables and we’ll use 50-50 split to have a train and test data. The datas are:
- German Credit: The German Credit data frame has 1000 rows and 8 columns. This are data for clients of a south german bank, 700 good payers and 300 bad payers. They are used to construct a credit scoring method. This data have 20 predictive variables and 1000 observations and have a bad rate of 30%. So, after the select subsets to fit the models the distributions in the data are:
## sample ## response test train ## bad 14.9% 15.1% ## good 38.1% 31.9%
- Bankloan Binning: This is a hypothetical data file containing financial and demographic information on past customers. Here is the source. This data have 8 predictive variables and 5000 observations and have a bad rate of 25.1%.
## sample ## response test train ## bad 12.7% 12.4% ## good 37.3% 37.6%
The models to compare are logistic, conditional inference trees (party package), single-hidden-layer neural network (nnet pakage) and linear discriminant analysis. To evalute the performance there are some indicators like KS statistic, Area under ROC curve among others. If you are not familiar with this terms check this link. Now, let’s go with the results.
German Credit Data
## SCORE SAMPLE BR KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50 ## 2 SCORE_Logistic train 0.32 0.54 0.83 0.25 0.48 0.64 0.74 0.82 ## 4 SCORE_CTree train 0.32 0.40 0.76 0.53 0.53 0.61 0.87 0.87 ## 6 SCORE_SLNNET train 0.32 0.64 0.90 0.30 0.54 0.73 0.82 0.89 ## 8 SCORE_LDA train 0.32 0.54 0.83 0.24 0.47 0.63 0.76 0.82 ## 1 SCORE_Logistic test 0.28 0.46 0.78 0.23 0.43 0.58 0.72 0.82 ## 3 SCORE_CTree test 0.28 0.37 0.74 0.50 0.50 0.57 0.87 0.87 ## 5 SCORE_SLNNET test 0.28 0.42 0.77 0.21 0.40 0.58 0.68 0.79 ## 7 SCORE_LDA test 0.28 0.47 0.79 0.23 0.46 0.57 0.73 0.82 daux <- subset(data1, SAMPLE == "test") daux_roc <- ldply(str_pattern(names(daux),"SCORE"), function(score){ perf <- performance(prediction(daux[[score]], daux$GB), "tpr","fpr") df <- data.frame(x = unlist(perf@"x.values") , y = unlist(perf@"y.values")) df$score <- score df }) ggplot(daux_roc) + geom_line(aes(x,y, color = score), size = 1.2) + scale_color_manual('',values=brewer.pal(length(unique(daux_roc$score)), "RdBu")) + geom_path(data=data.frame(x = c(0,1), y = c(0,1)), aes(x,y), colour = "gray", size = 1) + scale_x_continuous("False Positive Rate (1 - Specificity)", labels = percent_format(), limits = c(0, 1)) + scale_y_continuous("True Positive Rate (Sensivity or Recall)", labels = percent_format(), limits = c(0, 1)) + theme(legend.position = "top") + ggtitle("ROC Curves for German Credit Data (validation)")
Now we can plot the distributions of good/bads in each model. We’ll transform the data whith melt function and then plot faceting by score.
daux <- subset(data1, SAMPLE == "test", select = c("GB", "SCORE_Logistic", "SCORE_CTree","SCORE_SLNNET","SCORE_LDA")) daux <- melt(daux, id = "GB") ggplot(daux, aes(x=value, fill = factor(GB))) + geom_density(alpha = 0.6, size = .75) + facet_wrap(~variable, ncol=2) + scale_fill_manual(values = brewer.pal(3, "Dark2")) + theme(legend.position = "none", axis.ticks = element_blank(), axis.text = element_blank(), axis.title = element_blank(), plot.margin = unit(rep(0.5, 4), "lines"), title = element_text(size = 9))
Bankloan Binning data
## SCORE SAMPLE BR KS AUCROC Gain10 Gain20 Gain30 Gain40 Gain50 ## 2 SCORE_Logistic train 0.25 0.54 0.84 0.31 0.51 0.67 0.79 0.89 ## 4 SCORE_CTree train 0.25 0.52 0.84 0.34 0.50 0.82 0.82 0.90 ## 6 SCORE_SLNNET train 0.25 0.56 0.86 0.32 0.52 0.68 0.82 0.89 ## 8 SCORE_LDA train 0.25 0.52 0.84 0.31 0.51 0.66 0.78 0.88 ## 1 SCORE_Logistic test 0.25 0.53 0.84 0.31 0.51 0.68 0.78 0.87 ## 3 SCORE_CTree test 0.25 0.45 0.79 0.30 0.51 0.76 0.76 0.84 ## 5 SCORE_SLNNET test 0.25 0.51 0.83 0.30 0.51 0.66 0.77 0.85 ## 7 SCORE_LDA test 0.25 0.50 0.83 0.31 0.51 0.67 0.77 0.85
Do you want to comment about the results? If you are interesting in this topic reproduce this example. And if you have questions and/or improvements or want to know more details for the code please comment.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.