Load Balanced Parallelization with snowfall
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For some reason, I didn’t notice a few months ago the best way to perform a parallelized version of Lapply with package snowfall.
We implemented the parallel version of function lapply with the function sfLapply, in the development of our pipeline prototype for Exome Variant Analysis ( https://launchpad.net/eva ).
However, I’ve just read the nice tutorial from Knaus & Porzelius (2009), in which he shows a nice diagram to clarify why sfClusterApplyLB can be better to have a load balanced version of your own code:
Therefore, we changed the critical line, easily, from :
# ... start3 <- Sys.time(); result2 <- sfLapply(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3; # ...
to:
# ... start3 <- Sys.time(); result2 <- sfClusterApplyLB(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3; # ...
(as you can see, we are parallelizing here per samples, not per processes within each sample; one thing at a time, since we only have a few spare cpus in our servers and we are not running the process in a real cluster yet)
With our test datasets, we cannot notice any great difference (a couple of small files for debugging purposes), but we’ll be glad to check the potential improvement (let’s hope so) with real case scenarios in short, in which some samples are way bigger than some other ones…
In my todo list there is a new entry related to the other interesting function called “sfClusterApplySR“, explained also in the standard vignettes from snowfall:
And we hope to find some time in the following months to test a similar parallelization process with the “parallel” package (even if I have no clue yet whether there is any equivalent approach for load-balanced parallelization).
Some day…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.