[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I had a discussion with a student about variability of results obtained from cross-validation procedure. While the subject is well known there are not many examples on the web showing it, so I have written its simple presentation.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Results from cross-validation are reported as a standard by rpart procedure (printcp and plotcp) and optimal cp is selected for tree pruning. Many people I have talked to think that because each time rpart is run on the same data-set the same tree is obtained that also printcp and plotcp results do not change. However, it should be remembered that x-val relative error returned by them is based on random sampling and is not constant. Therefore two runs of rpart might indicate different values of optimal cp.
Here is the code that illustrates this situation using Participation data from Ecdat package:
library(Ecdat)< o:p>
library(rpart)< o:p>
data(Participation)< o:p>
set.seed(1)< o:p>
xerror <- t(replicate(8192,< o:p>
rpart(lfp ~ ., data = Participation)$cptable[,4]))< o:p>
tree.size <- factor(< o:p>
rpart(lfp ~ ., data = Participation)$cptable[,2] + 1)< o:p>
colnames(xerror) <- tree.size< o:p>
par(mfrow = c(1, 2))< o:p>
boxplot(xerror, xlab = “size of tree”,< o:p>
ylab = “X-val Relative Error”)< o:p>
plot(tree.size[apply(xerror,1, which.min)], xlab = “size of tree”,< o:p>
ylab = “# minimal”)< o:p>
The resulting plot is the following:
We can see that using x-val criterion tree of size 5 is selected in around 2/3 of cases and size 6 is found best otherwise.
The other issue is why there is no variability of x-val for tree size 1 and almost no variability at size 2. The answer is that for those tree sizes the split in every cross-validation fold is made on nominal variable (for example foreign for tree size 1) at the same cut-point and all resulting trees are identical (one outlier for tree size 2 is due single different split). For tree sizes 5 and 6 continuous variables enter the tree (age and lnnlinc) and cut-points start moving, so the resulting trees in different cross-validation runs are different.
To leave a comment for the author, please follow the link and comment on their blog: R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.