Site icon R-bloggers

Make pleasingly parallel R code with rxExecBy

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these "embarrassingly parallel" problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, "pleasingly parallel" may well be a more appropriate name.

Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R. A foreach loop is much like a regular for loop in R, and by default will run each iteration in sequence (again, just like a for loop). But by registering a parallel "backend" for foreach, you can run many (or maybe even all) iterations at the same time, using multiple processors on the same machine, or even multiple machines in the cloud.

For many applications, though, you need to provide a different chunk of data to each iteration to process. (For example, you may need to fit a statistical model within each country — each iteration will then only need the subset for one country.) You could just pass the entire data set into each iteration and subset it there, but that's inefficient and may even be impractical when dealing with very large datasets sitting in a remote repository. A better idea would be to leave the data where it is, and run R within the data repository, in parallel.

Microsoft R 9.1 introduces a new function, rxExecBy, for exactly this purpose. When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions. The data doesn't actually move: R runs directly on the data platform. You can also run it on local data in various formats

The rxExecBy function is included in Microsoft R Client (available free) and Microsoft R Server. For some examples of using rxExecBy, take a look at the Microsoft R Blog post linked below.

Microsoft R Blog: Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.