[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
> require('RWeka') > require('pROC') > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv('credit_count.csv') > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <- sample(1:nrow(df2), nrow(df2) - 1000) > set1 <- df2[rows, ] > set2 <- df2[-rows, ] > > # BUILD A PART RULE MODEL > mdl1 <- PART(factor(BAD) ~., data = set1) > print(mdl1) PART decision list ------------------ EXP_INC > 0.000774 AND AGE > 21.833334 AND INCOME > 2100 AND MAJORDRG <= 0 AND OWNRENT > 0 AND MINORDRG <= 1: 0 (2564.0/103.0) AGE > 21.25 AND EXP_INC > 0.000774 AND INCPER > 17010 AND INCOME > 1774.583333 AND MINORDRG <= 0: 0 (2278.0/129.0) AGE > 20.75 AND EXP_INC > 0.016071 AND OWNRENT > 0 AND SELFEMPL > 0 AND EXP_INC <= 0.233759 AND MINORDRG <= 1: 0 (56.0) AGE > 20.75 AND EXP_INC > 0.016071 AND SELFEMPL <= 0 AND OWNRENT > 0: 0 (1123.0/130.0) OWNRENT <= 0 AND AGE > 20.75 AND ACADMOS <= 20 AND ADEPCNT <= 2 AND MINORDRG > 0 AND ACADMOS <= 14: 0 (175.0/10.0) OWNRENT <= 0 AND AGE > 20.75 AND ADEPCNT <= 0: 0 (1323.0/164.0) INCOME > 1423 AND OWNRENT <= 0 AND MINORDRG <= 1 AND ADEPCNT > 0 AND SELFEMPL <= 0 AND MINORDRG <= 0: 0 (943.0/124.0) SELFEMPL > 0 AND MAJORDRG <= 0 AND ACADMOS > 85: 0 (24.0) SELFEMPL > 0 AND MAJORDRG <= 1 AND MAJORDRG <= 0 AND MINORDRG <= 0 AND INCOME > 2708.333333: 0 (17.0) SELFEMPL > 0 AND MAJORDRG <= 1 AND OWNRENT <= 0 AND MINORDRG <= 0 AND INCPER <= 8400: 0 (13.0) SELFEMPL <= 0 AND OWNRENT > 0 AND ADEPCNT <= 0 AND MINORDRG <= 0 AND MAJORDRG <= 0: 0 (107.0/15.0) OWNRENT <= 0 AND MINORDRG > 0 AND MINORDRG <= 1 AND MAJORDRG <= 1 AND MAJORDRG <= 0 AND SELFEMPL <= 0: 0 (87.0/13.0) OWNRENT <= 0 AND SELFEMPL <= 0 AND MAJORDRG <= 0 AND MINORDRG <= 1: 0 (373.0/100.0) MAJORDRG > 0 AND MINORDRG > 0 AND MAJORDRG <= 1 AND MINORDRG <= 1: 0 (29.0) SELFEMPL <= 0 AND OWNRENT > 0 AND MAJORDRG <= 0: 0 (199.0/57.0) OWNRENT <= 0 AND SELFEMPL <= 0: 0 (84.0/24.0) MAJORDRG > 1: 0 (17.0/3.0) ACADMOS <= 34 AND MAJORDRG > 0: 0 (10.0) MAJORDRG <= 0 AND ADEPCNT <= 2 AND OWNRENT <= 0: 0 (29.0/7.0) OWNRENT > 0 AND SELFEMPL > 0 AND EXP_INC <= 0.218654 AND MINORDRG <= 2 AND MINORDRG <= 1: 0 (8.0/1.0) OWNRENT > 0 AND INCOME <= 2041.666667 AND MAJORDRG > 0 AND ADEPCNT > 0: 1 (5.0) OWNRENT > 0 AND AGE > 33.416668 AND ACADMOS <= 174 AND SELFEMPL > 0: 0 (10.0/1.0) OWNRENT > 0 AND SELFEMPL <= 0 AND MINORDRG <= 1 AND AGE > 33.5 AND EXP_INC > 0.006737: 0 (6.0) EXP_INC > 0.001179: 1 (16.0/1.0) : 0 (3.0) Number of Rules : 25 > pred1 <- data.frame(prob = predict(mdl1, newdata = set2, type = 'probability')[, 2]) > # ROC FOR TESTING SET > print(roc1 <- roc(set2$BAD, pred1$prob)) Call: roc.default(response = set2$BAD, predictor = pred1$prob) Data: pred1$prob in 905 controls (set2$BAD 0) < 95 cases (set2$BAD 1). Area under the curve: 0.6794 > > # BUILD A LOGISTIC REGRESSION > mdl2 <- Logistic(factor(BAD) ~., data = set1) > print(mdl2) Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable 0 ==================== AGE 0.0112 ACADMOS -0.0005 ADEPCNT -0.0747 MAJORDRG -0.2312 MINORDRG -0.1991 OWNRENT 0.2244 INCOME 0.0004 SELFEMPL -0.1206 INCPER 0 EXP_INC 0.4472 Intercept 0.7965 Odds Ratios... Class Variable 0 ==================== AGE 1.0113 ACADMOS 0.9995 ADEPCNT 0.928 MAJORDRG 0.7936 MINORDRG 0.8195 OWNRENT 1.2516 INCOME 1.0004 SELFEMPL 0.8864 INCPER 1 EXP_INC 1.5639 > pred2 <- data.frame(prob = predict(mdl2, newdata = set2, type = 'probability')[, 2]) > # ROC FOR TESTING SET > print(roc2 <- roc(set2$BAD, pred2$prob)) Call: roc.default(response = set2$BAD, predictor = pred2$prob) Data: pred2$prob in 905 controls (set2$BAD 0) < 95 cases (set2$BAD 1). Area under the curve: 0.6529 > > # COMPARE TWO ROCS > roc.test(roc1, roc2) DeLong's test for two correlated ROC curves data: roc1 and roc2 Z = 1.0344, p-value = 0.301 alternative hypothesis: true difference in AUC is not equal to 0 sample estimates: AUC of roc1 AUC of roc2 0.6793894 0.6528875
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.