Site icon R-bloggers

doFuture: A universal foreach adaptor ready to be used by 1,000+ packages

[This article was first published on jottR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

doFuture 0.4.0 is available on CRAN. The doFuture package provides a universal foreach adaptor enabling any future backend to be used with the foreach() %dopar% { ... } construct. As shown below, this will allow foreach() to parallelize on not only multiple cores, multiple background R sessions, and ad-hoc clusters, but also cloud-based clusters and high performance compute (HPC) environments.

1,300+ R packages on CRAN and Bioconductor depend, directly or indirectly, on foreach for their parallel processing. By using doFuture, a user has the option to parallelize those computations on more compute environments than previously supported, especially HPC clusters. Notably, all plyr code with .parallel = TRUE will be able to take advantage of this without need for modifications – this is possible because internally plyr makes use of foreach for its parallelization.

With doFuture, foreach can process your code in more places than ever before. Alright, it may not be able to process this programmer’s 62,500 punched cards.

What is new in doFuture 0.4.0?

For full details on updates, please see the NEWS file. The doFuture package installs out-of-the-box on all operating systems.

A quick example

Here is a bootstrap example using foreach adapted from help("clusterApply", package = "parallel"). I use this example to illustrate how to perform foreach() iterations in parallel on a variety of backends.

library("boot")

run <- function(...) {
  cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
  cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
  boot(cd4, corr, R = 10000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

## Attach doFuture (and foreach), and tell foreach to use futures
library("doFuture")
registerDoFuture()

## Sequentially on the local machine
plan(sequential)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed 
## 298.728   0.601 304.242

# In parallel on local machine (with 8 cores)
plan(multiprocess)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed 
## 452.241   1.635  68.740

# In parallel on the ad-hoc cluster machine (5 machines with 4 workers each)
nodes <- rep(c("n1", "n2", "n3", "n4", "n5"), each = 4L)
plan(cluster, workers = nodes)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##   2.046   0.188  22.227

# In parallel on Google Compute Engine (10 r-base Docker containers)
vms <- lapply(paste0("node", 1:10), FUN = googleComputeEngineR::gce_vm, template = "r-base")
vms <- lapply(vms, FUN = gce_ssh_setup)
cl <- as.cluster(vms, docker_image = "henrikbengtsson/r-base-future")
plan(cluster, workers = cl)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##   0.952   0.040  26.269

# In parallel on a HPC cluster with a TORQUE / PBS scheduler
# (Note, the below timing includes waiting time on job queue)
plan(future.BatchJobs::batchjobs_torque, workers = 10)
system.time(boot <- foreach(i = 1:100, .packages = "boot") %dopar% { run() })
##    user  system elapsed
##  15.568   6.778  52.024

About .export and .packages

When using doFuture::registerDoFuture(), there is no need to manually specify which global variables (argument .export) to export. By default, the doFuture backend automatically identifies and exports all globals needed. This is done using recursive static-code inspection. The same is true for packages that need to be attached; those will also be handled automatically and there is no need to specify them manually via argument .packages. This is in line with how it works for regular future constructs, e.g. y %<-% { a * sum(x) }.

Having said this, you may still want to specify arguments .export and .packages because of the risk that your foreach() statement may not work with other foreach adaptors, e.g. doParallel and doSNOW. Exactly when and where a failure may occur depends on the nestedness of your code and the location of your global variables. Specifying .export and .packages manually skips such automatic identification.

Finally, I recommend that you as a developer always try to write your code in such way the users can choose their own futures: The developer decides what should be parallelized – the user chooses how.

Happy futuring!

Links

See also

To leave a comment for the author, please follow the link and comment on their blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.