A very short and unoriginal introduction to snow

Posted on April 2, 2011 by Jason in R bloggers | 0 Comments

[This article was first published on Left Censored » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As Jian-Feng rightly pointed out in a comment on my guide to setting up snow on the OSC cluster, it was probably somewhat cavalier of me to say:

Getting snow to run properly on single machines, or ever with a cluster of machines via ssh connections is fairly trivial.

In an effort to redeem myself, I provide this very short and unoriginal introduction to using snow. But first a caveat: to make the most of parallel processing in R, or any other environment, the problem you are trying to solve must be amenable to being broken up into smaller, (mostly) independent pieces. In other words, the results from one piece should not be dependent on the results from another. In statistics, depending on the problem at hand, this may or may not apply. Bootstrapping, a simple example of which I provide below, is one place where parallel processing can provide excellent returns from parallelization. On the other hand, a typical maximum likelihood estimate using, for instance, a BFGS optimization routine would gain little from parallel processing since step \(n+1\) is dependent on the results of step \(n\). (Unsurprisingly, things are a bit more complicated than this, and if you are really interested in learning about parallel processing, you may want to start with reading the Wikipedia entry.)

This simple example demonstrates how to calculate bootstrapped sample means of a given vector in parallel across a cluster. First, load the snow and rlecuyer libraries. Of course, snow is what provides the parallel processing, but rlecuyer is equally important as it guarantees the random numbers generated in each process are independent (snow also supports the rsprng library).

> library(snow)
> library(rlecuyer)

Now set up some sample data. Here I take 100 random draws, with replacement, from the integers in \([0,5]\).

> x <- sample(0:5, 100, replace = TRUE)
> mean(x)
[1] 2.64

Define a simple function to calculate a single bootstrapped mean from a given vector:

> bs.mean <- function(v) {
+   s <- sample(v, length(v), replace = TRUE)
+   mean(s)
+ }

Now it’s time to set up the cluster. Here I set up a SOCK-type connection, which can be used to set up multiple R instances on the local machine and/or to set up R instances on remote machines through ssh connections. snow offers other connection options that may be more convenient or necessary depending on your environment (for instance, MPI was needed on the OSC cluster).

> cl <- makeCluster(c("localhost", "localhost"), type = "SOCK")

Here, c("localhost", "localhost") tells snow where to set up the R instances, while type = "SOCK" is obviously the connection type. If I also wanted to run a single instance on a remote machine named chuck, I could specify c("localhost", "localhost", "chuck"). In this case, I would be prompted for my ssh password for chuck, though snow would take care of the rest once the connection was authenticated.

Once the connections are set up, you will want to provide unique random seeds on each of the instances.

> clusterSetupRNG(cl)
[1] "RNGstream"

The return value, RNGstream, just tells you what type of RNG was set up. Finally, it’s time to do some work.

> clusterCall(cl, bs.mean, x)
[[1]]
[1] 2.81

[[2]]
[1] 2.61

clusterCall instructs all instances in cl to execute the function bs.mean on the vector x, both of which we defined above. The results are returned in a list with a length equal to the number of instances; e.g., had we included chuck in our call to makeCluster, clusterCall would have returned a list of three bootstrapped means. Because bs.mean doesn’t depend on anything calculated by the other processes, these bootstrapped means are calculated in parallel.

When you are done with the cluster, you should always stop it. Otherwise, you may have to kill R instances by hand.

> stopCluster(cl)

Like I said at the outset, this was just a very short and unoriginal introduction to parallel processing with snow. There are many other examples available online, a couple of which I provide links to below.

Luke Tierney’s (the author of snow) detailed guide can be found here: http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html
snow Simplified: http://www.sfu.ca/~sblay/R/snow.html
Some REvolution Computing alternatives to snow are introduced here.
Tal Galili provides a guide for parallel processing on Windows over at R-statistics.

To leave a comment for the author, please follow the link and comment on their blog: Left Censored » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A very short and unoriginal introduction to snow

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)