Parallel execution of randomForestSRC
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I guess I’m the resident expert on resampling methods at work. I’ve been using bagged predictors and random forests for a while, and have recently been using the randomForestSRC (RF-SRC) package in R (http://cran.r-project.org/web/packages/randomForestSRC). This package merges the two randomForest implementations, randomForest package for regression and classification forests and the randomSurvivalForest package for survival forests.
By default the package is installed to run on one processor, however, being embarrassingly parallelizable, a major advantage of RF-SRC is that it can be compiled to run on multicore machines easily. It does take a little tweaking to get it to work though, and this post is intended to document that process. I assume you have R installed, and have a compiler for package installation (R-dev libraries possibly).
As Larry Wall put it “There’s More Than One Way to Do It”, and there certainly could be another smoother path to get this to work. I’ll just note what I did, and am open to modifications.
First, we do need to compile from source, so download the source package from CRAN at http://cran.r-project.org/web/packages/randomForestSRC and unpack it in your favorite dev directory.
For serial execution, you can either install as is
R CMD INSTALL randomForestSRC
or from within R, just use
install.packages("randomForestSRC")
For parallel code, open your terminal for the following commands:
cd randomForestSRC autoconf
autoconf with create a configure file for compilation of the source code.
cd .. R CMD INSTALL randomForestSRC
This will compile and install the code in your library. If you also want to install an alternate binary (x86_64 and i386 on Mac OS X) you will also need the following
R32 CMD INSTALL --clean --libs-only randomForestSRC
or
R64 CMD INSTALL --clean --libs-only randomForestSRC
Depending on which architecture your machine reverts to by default.
At this point, you can run either architecture R32/R64 or simply the default R, and load the package.
library(randomForestSRC)
Then run an example like:
### Survival analysis ### Veteran data ### Randomized trial of two treatment regimens for lung cancer data(veteran, package = "randomForestSRC") v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, ntree = 100) # print and plot the grow object print(v.obj) plot(v.obj)
And watch all the processors light up in htop.
You can also control the processor use by either setting the RF_CORES environment variable, or adding
options(rf.cores = x)
to your ~/.Rprofile file.
Happy burying your processors!
Filed under: R Tagged: R, randomForest, randomForestSRC
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.