Going parallel: understanding load balancing in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With most computers now having many cores, parallelising your R code is a great way to reduce processing time, and there are lots of packages to help. But if you benchmark your new code, you may find it is not as fast as you expected. For example, you might assume that splitting the work across 4 cores might make the code run 4 times faster, but in fact it could be less than twice as fast. Still an improvement, but not great. This happens when each of the tasks has a different run time, and the tasks have been poorly allocated to the different cores. Thus most of your cores sit idle while one core does most of the work.
Load balancing is the idea of evenly distributing tasks to the cores to minimise overall run time. As we will see, it can make a massive difference to your code performance.
Consider this function, which does nothing for x2 seconds. Obviously not very useful, but the runtime of this function varies dramatically for different values of x.
slow_func <- function(x){ Sys.sleep(x ** 2) }
And suppose we have 20 tasks to do.
vals <- rep(1:10, 2)
On a single core this would take 770 seconds (12.8 minutes). Can we speed this up by splitting the task across 4 cores?
The default way that R allocates these tasks out to 4 cores is to break the tasks into 4 batches in the order they are provided. This is done by the internal function parallel:::splitList()
.
The result is that two of the cores get most of the long running tasks while the other two get most of the quick tasks. This is why most of the R packages offer an option to dynamically load balance, although it is usually turned off by default. When dynamic load balancing is used instead of allocating all the task at the beginning R will give each core one task. Subsequent tasks are allocated to cores as they finish tasks. Thus a core with a short task is quickly reallocated new work, while a core with a slow task may only do one or two tasks for the whole run.
This significantly reduces the imbalance between cores, but is also far from optimal. Using load balancing also provides an overall performance hit as the workers must spend more time “talking” to the main R process and transferring data back and forth.
Can we do any better? Well, in our example, we know exactly how long each task will take, so we should be able to pre-allocate the task in a more efficient manner. But even in the real world, you likely have some idea of which tasks will be fast and slow. For example, if you are processing many data frames, the number of rows in each data frame may be proportional to the processing time.
If we do have a measure of task duration, we always want to do the slowest tasks first. In the previous example, some of the longest tasks were the last to be allocated, which makes it difficult of the load balancer to allocate the tasks efficiently. But, a word of warning, sorting the tasks without load balancing can be disastrously slow.
In this case we have given Core 1 all the slow tasks, and Core 4 all the fast tasks. This is about as bad as you can possibly do at load balancing. But if we endable dynamic load balancing, then slowest first works really well. Note that sorting the data changes the order of results, if this matters to your code, you will need to record the original order so that you can “unsort” your data once the parallel process has been completed.
It is also possible to get a similar distribution of tasks without dynamic load balancing using what I call a zig-zag sort.
zigzag_sort <- function(x, n = 4){ x <- x[order(x, decreasing = TRUE)] sortvec <- rep(c(seq(1,n),seq(n, 1)), length = length(x)) sortvec <- order(sortvec) x <- x[sortvec] return(x) }
This function will sort a vector from highest to lowest but zig zagging across the cores in a 1,2,3,4,4,3,2,1 pattern. The n
argument can be changed to reflect the number of cores you have.
A zig-zag sort is almost a good as dynamic load balancing, but in the real world would not have the overheads caused by dynamic load balancing. Depending on the nature of your task the best option may vary.
That’s enough theory; how does this work in practice? I wrote some short code to test parallelising my task using different packages. Each function also gives the option to use zig-zag sorting.
method_lapply <- function(vals, sort){ if(sort){ vals <- zigzag_sort(vals, 4) } r <- lapply(vals, slow_func) r <- unlist(r) return(r) } method_parlapply <- function(vals, sort){ if(sort){ vals <- zigzag_sort(vals, 4) } cl <- parallel::makeCluster(4) r <- parallel::parLapply(cl = cl, vals, slow_func) parallel::stopCluster(cl) r <- unlist(r) return(r) } method_parlapplyLB <- function(vals, sort){ if(sort){ vals <- zigzag_sort(vals, 4) } cl <- parallel::makeCluster(4) r <- parallel::parLapplyLB(cl = cl, vals, slow_func) parallel::stopCluster(cl) r <- unlist(r) return(r) } method_foreach <- function(vals, sort, preschedule){ if(sort){ vals <- zigzag_sort(vals, 4) } opts <- list(preschedule = preschedule) cl <- parallel::makeCluster(4) doSNOW::registerDoSNOW(cl) boot <- foreach::foreach(i = vals, .export = c("slow_func", "zigzag_sort"), .options.snow = opts) r <- foreach::%dopar%(boot, slow_func(i)) parallel::stopCluster(cl) r <- unlist(r) return(r) } method_future_lapply <- function(vals, sort, scheduling){ if(sort){ vals <- zigzag_sort(vals, 4) } future::plan("future::multisession", workers = 4) r <- future.apply::future_lapply(vals, slow_func, future.scheduling = scheduling, future.chunk.size = NULL) future::plan("sequential") r <- unlist(r) return(r) }
Results
As you can see from the table below, all the parallel options are faster than using a single core, but the choice of method does matter. Using future_lapply
is 54% faster than a simple parlappy
, and future_lapply
with zig-zag sorting and dynamic load balancing is pretty close to the optimal possible runtime of 3.2 minutes.
It is also clear that the base R functions parlappy
and parlappyLB
do not do as well as some of the newer packages. Zig-Zag sorting always improved performance, although the difference is minor when dynamic load balancing is also used.
Function | Zig-Zag Sorted | Dynamic Load Balance | Run Time (min) | Relative Runtime | |
lapply | No | No | 12.99 | 3.884 | Reference Singe Core Run |
parlappy | No | No | 7.21 | 2.156 | |
parlappy | Yes | No | 5.58 | 1.667 | |
parlappyLB | No | Yes | 5.55 | 1.661 | |
future_lapply | No | No | 5.53 | 1.663 | |
parlappyLB | Yes | Yes | 5.32 | 1.589 | |
future_lapply | Yes | No | 4.06 | 1.221 | |
foreach | No | No | 3.51 | 1.048 | |
foreach | Yes | No | 3.50 | 1.047 | |
future_lapply | No | Yes | 3.49 | 1.005 | |
foreach | No | Yes | 3.38 | 1.012 | |
foreach | Yes | Yes | 3.35 | 1.002 | |
future_lapply | Yes | Yes | 3.32 | 1 |
For your code, you will need to benchmark different load balacing options to see which works best for your use case. But hopefully this blog has at least shown how important it is to get load balancing right.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.