The Evolution of Distributed Programming in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Paulin Shek
Both R and distributed programming rank highly on my list of “good things”, so imagine my delight when two new packages, ddR (https://github.com/vertica/ddR) and multidplyr (https://github.com/hadley/multidplyr), used for distributed programming in R were released in November last year.
Distributed programming is normally taken up for a variety of reasons:
- To speed up a process or piece of code
- To scale up an interface or application for multiple users
There has been a huge appetite for this in the R community for a long time so my first thought was “Why now? Why not before?”.
From a quick look at CRAN’s High Performance Computing page, we can see the mass of packages that were available for related problems already. None of them have quite the same focus of ddR and multidplyr though. Let me explain. R has many features that make it unique and great. It is high-level, interactive and most importantly, it also has a huge number of packages. It would be a huge shame to not be able to use these packages, or if we were to lose these features when writing R code to be run on a cluster.
Traditionally, distributed programming has contrasted with these principles, with much more focus on low-level infrastructures, such as communications between nodes on a cluster. Popular R packages that dealt with these in the past are the now deprecated packages, snow and multicore (released on CRAN in 2003 and 2009 respectively). However, working with low level functionality of a cluster can detract from analysis work because it requires a slightly different skill set.
In addition, the needs of R users are changing and this is, in part, due to big data. Data scientists now need to be able to run experiments on, and analyse and explore much larger data sets, where running computations on it can be time consuming. Due to the fluid nature of exploratory analysis, this can be a huge hindrance. For the same reason, there is a need to be able to write parallelized code without having to think too hard about low-level considerations, and for it to be fast to write as well as easy to read. My point is that fast parallelized code should not just be for production code. The answer to this is an interactive scripting language that can be run on a cluster.
The package written to replace snow and multicore is the parallel package, which includes modified versions of snow and multicore. It starts to bridge the gap between R and more low-level work by providing a unified interface to cluster management systems. The big advantage to this is that R code will be the same, regardless of what protocol for communicating with the cluster is being used under the covers.
Another huge advantage of the parallel package is the “apply” type functions that are provided through this unified interface. This is an obvious but powerful way to extend R with parallelism, because each any call to an “apply” function with, say, FUN = foo can be split into multiple calls to foo, executed at the same time. The recently released packages ddR and multidplyr extend on the functionality provided by the parallel package. They are similar in many ways. Indeed the most significant way is that they are based on the introduction of new datatypes that are specifically for parallel computing. New functions on these data types are used to “partition” data to describe how work can be split amongst multiple nodes and also a function to collect the work and combine them to produce a final result.
ddR then also reimplements a lot of base functions on the distributed data types, for example rbind and tail. ddR is written by Vertica Analytics group, owned by HP. It is written to work with HP’s distributedR, which provides a platform for distributed computing with R.
Hadley Wickham’s package, multidplyr also works with distributedR, in additional to snow and parallel. Where multidplyr differs to ddR is that it is written to be used with the dplyr package. All methods provided in the dplyr package are overloaded to work with the data-types provided by multidplyr, furthering Hadley’s eco-system of R packages.
After a quick play with the two packages, many more differences emerge between the two packages.
The package multidplyr seems more suited to data-wrangling, much like its single-threaded equivalent, dplyr. The partition()
function can be given a series of vectors which describe how the data should be partitioned, very much like the group_by()
function:
# Extract of code that uses the multidplyr package library(dplyr) library(multidplyr) library(nycflights13) planes %>% partition() %>% group_by(type) %>% summarize(n())
However, ddR has a very different “flavour”, with a stronger algorithmic focus, as can be seen by the example packages randomForest.ddR, kmeans.ddR and glm.ddR, implemented with ddR. As can be seen in the code snippet below, certain algorithms such as random forests can be parallelised very naturally. Unlike multidplyr, the partition()
function does not give the user control over how the data is split. However, provided in the collect()
function is the index
argument, which gives the user control over which workers to collect results from. Also, the list returned by collect()
can then be fed into a do.call()
to aggregate the results, for example, using randomForest::combine()
.
# Skeleton code for implementing very primitive version of random forests using ddR library(ddR) library(randomForest) multipleRF <- dlapply(1:4, function(n){ randomForest::randomForest(Ozone ~ Wind + Temp + Month, data = airquality, na.action = na.omit) }) listRF <- collect(multipleRF) res <- do.call(randomForest::combine, collect(multipleRF))
To summarise, distributed programming in R has been slowly evolving for a long time but now in response to the high demand, many tools are being developed to suit the needs to R users who want to be able to run different types of analysis on a cluster. The prominent themes are as follows:
- Parallel programming in R should be high-level.
- Writing parallelised R code should be fast and easy, and not require too much planning.
- Users should still be able to access the same libraries that they usually use.
Of course, some of the packages mentioned in this post are very young. However, due to the need for such tools, they are rapidly maturing and I look forward to seeing where it goes in the very near future.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.