Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A couple of days ago, in our lab session, we discussed random forrests, and, since it was based on the example in ISLR, we had a quick discussion about the random choice of features, and the “\(m=\sqrt{p}\)” rule
Interestingly, on that one, we can play a bit, and try all choices, and do it again, on a different train/test split,
library(randomForest)
library(ISLR2)
set.seed(123)
sim = function(t){
train = sample(nrow(Boston), size = nrow(Boston)*.7)
subsim = function(i){
rf.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = i)
yhat.rf <- predict(rf.boston, newdata = Boston[-train, ])
mean((yhat.rf - Boston[-train, "medv"])^2)
}
Vectorize(subsim)(2:12)
}
M=Vectorize(sim)(1:499)
and now we can plot it, with the MSE on the test dataset, as a function of \(m\), the number of features selected, at each node
boxplot(t(M))
or more clearly
vm=apply(M,1,mean)
plot(2:12,vm,type="b",pch=19,ylim=c(10.5,15))
abline(v=sqrt(12),col="red")
Even if here, the “\(m=\sqrt{p}\)” rule might not be optimal, we can see that using a random forest instead of a bagging strategy, i.e. “\(m<\sqrt{p}\)“, could improve predictions (and not only make the code run faster).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.