Site icon R-bloggers

Ensemble, Part2 (Bootstrap Aggregation)

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part 1 consisted of building a classification tree with the “party” package.  I will now use “ipred” to examine the same data with a bagging (bootstrap aggregation) algorithm.

> library(ipred)
> train_bag = bagging(class ~ ., data=train, coob=T)
> train_bag

Bagging classification trees with 25 bootstrap replications

Call: bagging.data.frame(formula = class ~ ., data = train, coob = T)

Out-of-bag estimate of misclassification error:  0.0424

> table(predict(train_bag), train$class)
         
                       benign malignant
  benign               290           9
  malignant             11       162

> testbag = predict(train_bag, newdata=test)
> table(testbag, test$class)
         
testbag       benign    malignant
  benign          137          1
  malignant          6        67

If you compare the confusion matrices from this week to the prior post, what do you think?

Let’s recall the prior ROC curve and combine it with the bagged model.

#prepare bagged model for curve
> test.bagprob = predict(train_bag, type = “prob”, newdata = test)
> bagpred = prediction(test.bagprob[,2], test$class)
> bagperf = performance(bagpred, “tpr”, “fpr”)

> plot(perf, main=”ROC”, colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> legend(0.6, 0.6, c(‘ctree’, ‘bagging’), 1:2)
















As we could see from glancing at the confusion matrices, the bagged model outperforms the standard tree model.  Finally, let’s have a look at the AUC (.992 with bagging versus .985 last time around)

> auc.curve = performance(bagpred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”

Slot “y.name”:
[1] “Area under the ROC curve”

Slot “alpha.name”:
[1] “none”

Slot “x.values”:
list()

Slot “y.values”:
[[1]]
[1] 0.9918244


Slot “alpha.values”:
list()

OK, more iterations to come boosting, random forest and no self-respecting data scientist would leave out logistic regression.

Cheers.

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.