[This article was first published on Modern Tool Making, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There’s been some discussion on the kaggle forums and on a few blogs about various ways to parallelize random forests, so I thought I’d add my thoughts on the issue.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here’s my version of the ‘parRF’ function, which is based on the elegant version in the foreach vignette:
This function works very simply: you pass it a vector of mtry values, and it fits a random forest using each of those values and returns the combined result. You can all pass any additional parameters you want (like ntree) to the randomForest function.
I think this functions provides 2 improvements over previous implementations. #1 is you can use any parallel backend you want. doRedis is my current favorite, as it’s cross-platform and fault-tolerant and let’s me commandeer idle laptops around the house/office when a random forest is taking too long to fit. #2 is the argument .inorder=FALSE in the foreach function, which provides a small performance improvement as it lets R combine the random forests as they finish, rather than forcing R to combine them in the order they start.
Lets say you want a random forest with 5000 trees. The default value for ntree is 500, so we use rep(4,10) as the argument for the function.
Maybe we’re unsure of the optimal mtry value, and want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This gives us 2500 trees with mtry=3 and 2500 with mtry=4. I like to think of this as a sort of meta-ensemble of decision trees, but I’ve yet to see it improve my predictive accuracy.
At the very least, this can help with those damn ‘out of memory’ errors I’ve been getting on my laptop when fitting random forests to large datasets.
To leave a comment for the author, please follow the link and comment on their blog: Modern Tool Making.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.