Ensemble, Part2 (Bootstrap Aggregation)
[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part 1 consisted of building a classification tree with the “party” package. I will now use “ipred” to examine the same data with a bagging (bootstrap aggregation) algorithm. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
> library(ipred)
> train_bag = bagging(class ~ ., data=train, coob=T)
> train_bag
Bagging classification trees with 25 bootstrap replications
Call: bagging.data.frame(formula = class ~ ., data = train, coob = T)
Out-of-bag estimate of misclassification error: 0.0424
> table(predict(train_bag), train$class)
benign malignant
benign 290 9
malignant 11 162
> testbag = predict(train_bag, newdata=test)
> table(testbag, test$class)
testbag benign malignant
benign 137 1
malignant 6 67
If you compare the confusion matrices from this week to the prior post, what do you think?
Let’s recall the prior ROC curve and combine it with the bagged model.
#prepare bagged model for curve
> test.bagprob = predict(train_bag, type = “prob”, newdata = test)
> bagpred = prediction(test.bagprob[,2], test$class)
> bagperf = performance(bagpred, “tpr”, “fpr”)
> plot(perf, main=”ROC”, colorize=T)
> plot(bagperf, col=2, add=TRUE)
> plot(perf, col=1, add=TRUE)
> legend(0.6, 0.6, c(‘ctree’, ‘bagging’), 1:2)
As we could see from glancing at the confusion matrices, the bagged model outperforms the standard tree model. Finally, let’s have a look at the AUC (.992 with bagging versus .985 last time around)
> auc.curve = performance(bagpred, “auc”)
> auc.curve
An object of class “performance”
Slot “x.name”:
[1] “None”
Slot “y.name”:
[1] “Area under the ROC curve”
Slot “alpha.name”:
[1] “none”
Slot “x.values”:
list()
Slot “y.values”:
[[1]]
[1] 0.9918244
Slot “alpha.values”:
list()
OK, more iterations to come boosting, random forest and no self-respecting data scientist would leave out logistic regression.
Cheers.
To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.