The m=√p rule for random forests

arthur charpentier

14 hours ago

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple of days ago, in our lab session, we discussed random forrests, and, since it was based on the example in ISLR, we had a quick discussion about the random choice of features, and the “\(m=\sqrt{p}\)” rule

Interestingly, on that one, we can play a bit, and try all choices, and do it again, on a different train/test split,

library(randomForest)
library(ISLR2)
set.seed(123)

sim = function(t){
train = sample(nrow(Boston), size = nrow(Boston)*.7)
subsim = function(i){
rf.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = i)
yhat.rf <- predict(rf.boston, newdata = Boston[-train, ])
mean((yhat.rf - Boston[-train, "medv"])^2)
}
Vectorize(subsim)(2:12)
}
M=Vectorize(sim)(1:499)

and now we can plot it, with the MSE on the test dataset, as a function of \(m\), the number of features selected, at each node

boxplot(t(M))

or more clearly

vm=apply(M,1,mean)
plot(2:12,vm,type="b",pch=19,ylim=c(10.5,15))
abline(v=sqrt(12),col="red")

Even if here, the “\(m=\sqrt{p}\)” rule might not be optimal, we can see that using a random forest instead of a bagging strategy, i.e. “\(m<\sqrt{p}\)“, could improve predictions (and not only make the code run faster).

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related