Stability of classification trees
[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:
- Divide the data into training and test data set;
- Generate a random subset of training data and build logistic regression and classification tree using them;
- Apply the models on test data to obtain predicted probabilities;
- Repeat steps 2 and 3 many times;
- For each observation in test data set calculate standard deviation of obtained predictions for both classes of models;
- For both models plot kernel density estimator of standard deviation distribution in test data set.
library(party)
library(Ecdat)
data(Participation)
set.seed(1)
shuffle <- Participation[sample(nrow(Participation)),]
test <- shuffle[1:300,]
train <- shuffle[301:nrow(Participation),]
reps <- 1000
p.tree <- p.log <- vector(“list”, reps)
for (i in 1:reps) {
train.sub <- train[sample(nrow(train))[1:300],]
mtree <- ctree(lfp ~ ., data = train.sub)
mlog <- glm(lfp ~ ., data = train.sub, family = binomial)
p.tree[[i]] <- sapply(treeresponse(mtree, newdata = test),
function(x) { x[2] })
p.log[[i]] <- predict(mlog, newdata = test, type = “response”)
}
plot(density(apply(do.call(rbind, p.log), 2, sd)),
main=“”, xlab = “sd”)
lines(density(apply(do.call(rbind, p.tree), 2, sd)), col=“red”)
legend(“topright”, legend = c(“logistic”, “tree”),
col = c(“black”,“red”), lty = 1)
And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.
To leave a comment for the author, please follow the link and comment on their blog: R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.