Parallelism via “parSapply”
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In an earlier post, I used mclapply to kick off parallel R processes and to demonstrate inter-process synchronization via the flock package. Although I have been using this approach to parallelism for a few years now, I admit, it has certain important disadvantages. It works only on a single machine, and also, it doesn’t work on Windows.
Hence, to test the flock package on Windows, I had to resort to the alternative implementation. I liked it quite a bit and learned a few things on the way, so here it is.
require(RSQLite) require(parallel) require(flock) db.path = "C:/ttt.sqlite" lock.name = "C:/file.lock" con <- dbConnect(RSQLite::SQLite(), dbname=db.path) df <- data.frame(value=0) dbWriteTable(con, "test", df, overwrite=T) dbDisconnect(con) write.one.value <- function(val) { # Take an exclusive lock ll = lock(lock.name) # The critical section code con <- dbConnect(RSQLite::SQLite(), dbname=db.path) dbWriteTable(con, "test", data.frame(value=val), append=TRUE) dbDisconnect(con) # Release the lock unlock(ll) } write.values = function(cores, db.path, lock.name) { if(cores > 1) { cl = makeCluster(cores) # Load the packages into all slave processes clusterEvalQ(cl=cl, library(RSQLite)) clusterEvalQ(cl=cl, library(flock)) # Make variables visible in the work-horse function. clusterExport(cl, c("db.path", "lock.name")) tt = parSapply(cl=cl, 1:1000, write.one.value) stopCluster(cl) } else { environment(write.one.value) = environment() tt = lapply(1:1000, write.one.value) } } write.values(1, db.path, lock.name)
The interesting piece is write.values. First, it shows a neat R feature, namely, how to make variables visible in the callee without passing everything as arguments. It also shows how to branch between single core and single machine execution (the lapply branch) and a parallel execution (yes, via parSapply one can spawn process even across multiple machines, depends on the setup of the cluster). Why is that important? Often the errors returned from the single core (lapply) version are more clear and meaningful than the errors returned from the parallel version. In other words, until things are stable and bugs are out – the single core version is essential.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.