Site icon R-bloggers

future: Reproducible RNGs, future_lapply() and more

[This article was first published on jottR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

future 1.3.0 is available on CRAN. With futures, it is easy to write R code once, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).

Futures makes it easy to harness any resources at hand.
Thanks to great feedback from the community, this new version provides:
For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems.

A quick example

The bootstrap example of help("clusterApply", package = "parallel") adapted to make use of futures.
library("future")
library("boot")

run <- function(...) {
  cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
  cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
  boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

# base::lapply()
system.time(boot <- lapply(1:100, FUN = run))
###    user  system elapsed 
### 133.637   0.000 133.744

# Sequentially on the local machine
plan(sequential)
system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
###    user  system elapsed 
### 134.916   0.003 135.039 

# In parallel on the local machine (with 8 cores)
plan(multisession)
system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
###    user  system elapsed
###   0.960   0.041  29.527 
stopifnot(all.equal(boot1, boot0))

What’s next?

The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using plan(batchjobs_torque) instead of plan(multiprocess) your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon.

Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using plan(aws_lambda).

Happy futuring!

(*) Alright, I admit, it was intended.

Links

See also

To leave a comment for the author, please follow the link and comment on their blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.