A very short and unoriginal introduction to snow
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As Jian-Feng rightly pointed out in a comment on my guide to setting up snow on the OSC cluster, it was probably somewhat cavalier of me to say:
Getting
snow
to run properly on single machines, or ever with a cluster of machines viassh
connections is fairly trivial.
In an effort to redeem myself, I provide this very short and unoriginal introduction to using snow
. But first a caveat: to make the most of parallel processing in R, or any other environment, the problem you are trying to solve must be amenable to being broken up into smaller, (mostly) independent pieces. In other words, the results from one piece should not be dependent on the results from another. In statistics, depending on the problem at hand, this may or may not apply. Bootstrapping, a simple example of which I provide below, is one place where parallel processing can provide excellent returns from parallelization. On the other hand, a typical maximum likelihood estimate using, for instance, a BFGS optimization routine would gain little from parallel processing since step \(n+1\) is dependent on the results of step \(n\). (Unsurprisingly, things are a bit more complicated than this, and if you are really interested in learning about parallel processing, you may want to start with reading the Wikipedia entry.)
This simple example demonstrates how to calculate bootstrapped sample means of a given vector in parallel across a cluster. First, load the snow
and rlecuyer
libraries. Of course, snow
is what provides the parallel processing, but rlecuyer
is equally important as it guarantees the random numbers generated in each process are independent (snow
also supports the rsprng
library).
> library(snow) > library(rlecuyer)
Now set up some sample data. Here I take 100 random draws, with replacement, from the integers in \([0,5]\).
> x <- sample(0:5, 100, replace = TRUE) > mean(x) [1] 2.64
Define a simple function to calculate a single bootstrapped mean from a given vector:
> bs.mean <- function(v) { + s <- sample(v, length(v), replace = TRUE) + mean(s) + }
Now it’s time to set up the cluster. Here I set up a SOCK-type connection, which can be used to set up multiple R instances on the local machine and/or to set up R instances on remote machines through ssh
connections. snow
offers other connection options that may be more convenient or necessary depending on your environment (for instance, MPI was needed on the OSC cluster).
> cl <- makeCluster(c("localhost", "localhost"), type = "SOCK")
Here, c("localhost", "localhost")
tells snow where to set up the R instances, while type = "SOCK"
is obviously the connection type. If I also wanted to run a single instance on a remote machine named chuck
, I could specify c("localhost", "localhost", "chuck")
. In this case, I would be prompted for my ssh
password for chuck
, though snow
would take care of the rest once the connection was authenticated.
Once the connections are set up, you will want to provide unique random seeds on each of the instances.
> clusterSetupRNG(cl) [1] "RNGstream"
The return value, RNGstream
, just tells you what type of RNG was set up. Finally, it’s time to do some work.
> clusterCall(cl, bs.mean, x) [[1]] [1] 2.81 [[2]] [1] 2.61
clusterCall
instructs all instances in cl
to execute the function bs.mean
on the vector x
, both of which we defined above. The results are returned in a list with a length equal to the number of instances; e.g., had we included chuck
in our call to makeCluster
, clusterCall
would have returned a list of three bootstrapped means. Because bs.mean
doesn’t depend on anything calculated by the other processes, these bootstrapped means are calculated in parallel.
When you are done with the cluster, you should always stop it. Otherwise, you may have to kill R instances by hand.
> stopCluster(cl)
Like I said at the outset, this was just a very short and unoriginal introduction to parallel processing with snow
. There are many other examples available online, a couple of which I provide links to below.
- Luke Tierney’s (the author of
snow
) detailed guide can be found here: http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html - snow Simplified: http://www.sfu.ca/~sblay/R/snow.html
- Some REvolution Computing alternatives to snow are introduced here.
- Tal Galili provides a guide for parallel processing on Windows over at R-statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.