Site icon R-bloggers

Load Balanced Parallelization with snowfall

[This article was first published on UEB Blog. Musings on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For some reason, I didn’t notice a few months ago the best way to perform a parallelized version of Lapply with package snowfall.

We implemented the parallel version of function lapply with the function sfLapply, in the development of our pipeline prototype for Exome Variant Analysis ( https://launchpad.net/eva ).

However, I’ve just read the nice tutorial from Knaus & Porzelius (2009), in which he shows a nice diagram to clarify why sfClusterApplyLB can be better to have a load balanced version of your own code:

Click to enlarge


Therefore, we changed the critical line, easily, from :

# ...
  start3 <- Sys.time(); result2 <- sfLapply(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;

  # ...


to:

# ...
  start3 <- Sys.time(); result2 <- sfClusterApplyLB(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;

  # ...


(as you can see, we are parallelizing here per samples, not per processes within each sample; one thing at a time, since we only have a few spare cpus in our servers and we are not running the process in a real cluster yet)

With our test datasets, we cannot notice any great difference (a couple of small files for debugging purposes), but we’ll be glad to check the potential improvement (let’s hope so) with real case scenarios in short, in which some samples are way bigger than some other ones…

In my todo list there is a new entry related to the other interesting function called “sfClusterApplySR“, explained also in the standard vignettes from snowfall:

Quote:
Another helpful function for long running clusters is sfClusterApplySR, which saves intermediate results after processing n-indices (where n is the amount of CPUs). If it is likely you have to interrupt your program (probably because of server maintenance) you can start using sfClusterApplySR and restart your program without the results produced up to the shutdown time.


And we hope to find some time in the following months to test a similar parallelization process with the “parallel” package (even if I have no clue yet whether there is any equivalent approach for load-balanced parallelization).

Some day…

To leave a comment for the author, please follow the link and comment on their blog: UEB Blog. Musings on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.