parallelsugar: An implementation of mclapply for Windows
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An easy way to run R code in parallel on a multicore system is with the mclapply()
function. Unfortunately, mclapply()
does not work on Windows machines because the mclapply()
implementation relies on forking and Windows does not support forking.
Previously, I published a hackish solution that implemented a fake mclapply()
for Windows users with one of the Windows compatible parallel R
strategies. You can find further details here.
Due to positive user feedback, I have wrapped that script into a simple R package: parallelsugar.
Installation
Step 0: If you do not already have devtools
installed, install it using the instructions here. Note that for the purposes of this package, installing Rtools
is not necessary.
Step 1: Install parallelsugar
directly from my GitHub repository using install_github('nathanvan/parallelsugar')
. For the purposes of this package, you may ignore the error about Rtools
(unless you already have it installed, in which case the warning will not appear.)
> library(devtools) WARNING: Rtools is required to build R packages, but is not currently installed. ... snip ... > install_github('nathanvan/parallelsugar') Downloading github repo nathanvan/parallelsugar@master Installing parallelsugar ... snip ... * DONE (parallelsugar)
Usage examples
Basic Usage
On Windows, the following line will take about 40 seconds to run because by default, mclapply
from the parallel
package is implemented as a serial function on Windows systems.
library(parallel) system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) ) ## user system elapsed ## 0.00 0.00 40.06
If we load parallelsugar
, the default implementation of parallel::mclapply
, which used fork based clusters, will be overwritten by parallelsugar::mclapply
, which is implemented with socket clusters. The above line of code will then take closer to 10 seconds.
library(parallelsugar) ## ## Attaching package: ‘parallelsugar’ ## ## The following object is masked from ‘package:parallel’: ## ## mclapply system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) ) ## user system elapsed ## 0.04 0.08 12.98
Use of global variables and packages
By design, parallelsugar
approximates a fork based cluster — every object that is within scope to the master R process is copied over to the processes on the other sockets. This implies that
- you can quickly run out of memory, and
- you can waste a lot of time copying over unnecessary objects hanging
around in your R session.
Be warned!
## Load a package library(Matrix) ## Define a global variable a.global.variable <- Matrix::Diagonal(3) ## Define a global function wait.then.square <- function(xx){ ## Wait for 5 seconds Sys.sleep(5); ## Square the argument xx^2 } ## Check that it works with plain lapply serial.output <- lapply( 1:4, function(xx) { return( wait.then.square(xx) + a.global.variable ) }) ## Test with the modified mclapply par.output <- mclapply( 1:4, function(xx) { return( wait.then.square(xx) + a.global.variable ) }) ## Are they equal? all.equal( serial.output, par.output ) ## [1] TRUE
Request for feedback and help
I put this together because it helped to solve a specific problem that I was having. If it solves your problem, please let me know. If it needs to be modified to solve your problem, please either
- open an issue on GitHub, or
- even better, fork, fix, and issue a pull request.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.