rstanmulticore: A cross-platform R package to automatically run RStan MCMC chains in parallel
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
*** This work has been supported by a grant from the Spencer Foundation (#201400002). The views expressed are those of the author and do not necessarily reflect those of the Spencer Foundation. ***
It seems that the heir to WinBUGS is Stan. With Stan, reasonably complex Bayesian models can be expressed in a compact way and easily estimated. It is good software, and it is under active development to further improve it.
I have a small quibble about RStan, the R interface to Stan. RStan would be much improved if its default behavior was to run one MCMC chain per core. For software that prides itself on speed – Stan goes to the trouble translating the Stan modeling specification into a stand-alone C++ program for execution – it seems a little odd that the extra cores present on nearly all modern machines would not be put to use by default.
Currently, running chains in parallel is possible, but only with platform dependent boilerplate code. For example, the RStan Quick Start Guide gives an mclapply
example for Mac and Linux users and a parLapply
example for Windows users. The boilerplate nature of the code makes it cumbersome to fit models several times, and the platform dependent nature of the examples makes it difficult to share code between platforms.
To address this issue, I have implemented the boilerplate code from the Quick Start Guide in a cross-platform R package: rstanmulticore
. The syntax is easy. Simply replace a call to stan
fit.serial <- stan( model_code = schools_code, data = schools_dat, iter = 1000, chains = 4)
with a call to pstan
fit.parallel <- pstan(model_code = schools_code, data = schools_dat, iter = 1000, chains = 4)
The pstan
version will compile the model, distribute the compiled models to separate cores for a parallel run, and then recombine the results as if the code had execute serially. Since I used parLapply
to distribute the work to multiple cores, pstan
will run on Windows, Linux, and Mac.
At least as far as I have pushed pstan
, it works well for me. Your needs may differ. I would appreciate feedback and suggestions on how to improve it. You can access it via GitHub here. Installation instructions and a brief usage example are below.
Installation from GitHub
Step 0.A : If you do not already have rstan
installed, install it using the instructions here.
Step 0.B: If you do not already have devtools
installed, install it using the instructions here.
Step 1: Install rstanmulticore
directly from my GitHub repository using install_github('nathanvan/rstanmulticore')
.
> library(devtools) > install_github('nathanvan/rstanmulticore') Downloading github repo nathanvan/rstanmulticore@master Installing rstanmulticore "C:/PROGRA~1/R/R-31~1.3/bin/x64/R" --vanilla CMD INSTALL "C:/Users/vanhoudnos-nathan/AppData/Local/Temp/RtmpQBcRKa/devtools924351029d0/nathanvan-rstanmulticore-c7f9d4e" --library="C:/Users/vanhoudnos-nathan/Documents/R/win-library/3.1" --install-tests * installing *source* package 'rstanmulticore' ... ** R ** tests ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded *** arch - i386 *** arch - x64 * DONE (rstanmulticore)
A usage example
We begin with the default "Eight Schools" example from the Quick Start Guide using the default stan
function:
library(stan) ## Loading required package: Rcpp ## Loading required package: inline ## ## Attaching package: 'inline' ## ## The following object is masked from 'package:Rcpp': ## ## registerPlugin ## ## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60) ## The data to analyze (Yes, it is very little!) schools_dat <- list( J = 8, y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18)) ## The Stan model for the data, stored as a string schools_code <- 'data { int J; // number of schools real y[J]; // estimated treatment effects real sigma[J]; // s.e. of effect estimates } parameters { real mu; real tau; real eta[J]; } transformed parameters { real theta[J]; for (j in 1:J) theta[j] <- mu + tau * eta[j]; } model { eta ~ normal(0, 1); y ~ normal(theta, sigma); }' ## The data to analyze (Yes, it is very little!) schools_dat <- list( J = 8, y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18)) ## The Stan model for the data, stored as a string schools_code <- 'data { int J; // number of schools real y[J]; // estimated treatment effects real sigma[J]; // s.e. of effect estimates } parameters { real mu; real tau; real eta[J]; } transformed parameters { real theta[J]; for (j in 1:J) theta[j] <- mu + tau * eta[j]; } model { eta ~ normal(0, 1); y ~ normal(theta, sigma); }' ## Estimating the model fit.serial <- stan( model_code = schools_code, data = schools_dat, iter = 1000, chains = 4, seed = 1) ## ## TRANSLATING MODEL 'schools_code' FROM Stan CODE TO C++ CODE NOW. ## COMPILING THE C++ CODE FOR MODEL 'schools_code' NOW. ## cygwin warning: ## MS-DOS style path detected: C:/PROGRA~1/R/R-31~1.3/etc/x64/Makeconf ## Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-31~1.3/etc/x64/Makeconf ## CYGWIN environment variable option "nodosfilewarning" turns off this warning. ## Consult the user's guide for more details about POSIX paths: ## http://cygwin.com/cygwin-ug-net/using.html#using-pathnames ## ## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 1). ## ## ... < snip > ... ## ## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 2). ## ## ... < snip > ... ## ## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 3). ## ## ... < snip > ... ## ## SAMPLING FOR MODEL 'schools_code' NOW (CHAIN 4). ## ## ... < snip > ...
Note that stan
is pretty verbose.
I chose to make pstan
less verbose. By default, pstan
reports sparse progress information to the R console and the more detailed information is redirected to a file, stan-debug-*
, that is created in the current working directory. (If you wish to see the detailed info in real time, use Use tail -f
in your shell.)
Usage is as follows:
library(rstanmulticore) ## Loading required package: parallel fit.parallel <- pstan( model_code = schools_code, data = schools_dat, iter = 1000, chains = 4, seed = 1) ## *** Parallel Stan run *** ## Working directory: ## C:/Users/vanhoudnos-nathan/workspace/norc/spencer-5866.01.62/software/tmp ## + Compiling the Stan model. ## + Attempting 4 chains on 4 cores. ## ... Creating the cluster. ## ... Log file: stan-debug.2015-05-01-12.38.21.txt ## ... Loading rstan on all workers. ## ... Exporting the fitted model and data to all workers. ## ... Running parallel chains. ## ... Finished!
If, in the unlikely case, you want no console output and no file redirection, you can pass pdebug = FALSE
to pstan
. See help(pstan)
for details.
Note that, as promised, the output -- the actually samples drawn from the posterior -- of pstan
is identical to that of stan
all.equal( fit.serial@sim$samples, fit.parallel@sim$samples ) ## [1] TRUE
A request for help
As mentioned, rstanmulticore
works well for my needs, but it may not work for you. If it does not work for you, please let me know and I'll do my best to accommodate you. Pull requests and additional test cases are most welcome!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.