Are parallel computations worth it ?

Posted on May 31, 2013 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday, Daniel Marcelino published an interesting post on his blog, untitled Parallel Processing: When does it worth ? I was asking myself the same question for a chapter I am currently writing. And I did like his approach, so I tried, on my computer to do the same. I did use three packages to run parallel R codes,

> library(multicore)
> library(snow)
> library(snowfall)

and one to quantify time to run the code

> library(microbenchmark)

I ran the code on my mac, at the office,

> all=detectCores(all.tests=TRUE)
> all
[1] 4

which is a standard computer, with four cores. To run some codes, I had to generate datasets. Here, I consider a data frame, with $http://latex.codecogs.com/gif.latex?n$ rows, and 100 columns. I generate values using a Gaussian distribution,

> gen=function(n) data.frame(matrix(rnorm(n*100),n,100))

The goal, here, will be to compute quantiles (or to be more specific quartiles) per column, and to replicate that 100 times. Here, the standard technique is to use lapply. But two (at least) parallel version of the function can be found. So, let us use it

> base=gen(n=100)
> microbenchmark(
+ mlapp=data.frame(lapply(base, quantile, probs = 1:3/4 )),
+ mclapp=data.frame(mclapply(base, quantile, probs = 1:3/4 , mc.cores = all)),
+ sflapp=data.frame(sfLapply(base, quantile, probs = 1:3/4 )),
+ times=100) -> m

For instance, with 100 rows, we have

> m
Unit: milliseconds
    expr      min       lq   median       uq       max
1 mclapp 50.19290 55.90364 57.99185 64.10619 266.88692
2  mlapp 26.94146 29.49396 31.20571 49.54824  75.60251
3 sflapp 27.54857 30.10224 31.41864 47.10688  59.28925

And with 500,000 rows, we have

> m
Unit: seconds
    expr       min         lq     median        uq      max
1 mclapp 42.999504 103.873919 161.989876 258.66887 660.2953
2  mlapp  3.720542   3.770319   4.070116  11.90181 166.9461
3 sflapp  3.587703   3.770399   4.027876  10.62654 181.0093

So yes, using parallel code would be very interesting ! Especially with very large datasets (I could not run it with 1 million rows). If we consider a loop, to see the evolution of the median time, for each of those three function, we can plot the time it took, as a function of the number of rows,

> i=1;vk=seq(1,6,by=.2)
> col=seq(i,3*2,by=3)
> plot(10^vk,db[2,col],ylim=range(db),col="white",log="x",
+     xlab="Number of rows",ylab="Time")
+ polygon(c(10^vk,rev(10^vk)),c(db[1,col],rev(db[3,col])),col="light blue",border=NA)
+ lines(10^vk,db[2,col],col="blue",lwd=2)

Here, we have the following, with the standard lapply on the left (the line if the median time, with quartiles, 25% and 75%), the multicore function in the middle, and the snowfall function, on the right,

If we zoom in, for small datasets (less than 10,000 rows and 100 columns), we do observe a gain, since the code ran two times faster

So clearly, it might be interesting to write codes to distribute on different cores. But here, I use a simple function (I compute quantiles on columns of a dataset). I should try with a more complex function…

On the other hand, I should mention that, usually, while I have have one (or two) codes running, I can do something else : seeking for recent papers for ongoing research projects, answer to emails that I should have answered a few weeks ago, checking for typos in the book and update the tex file, or type parts of a future posts on my blog, etc. The problem I got yesterday afternoon, when I ran the code, was that suddenly, all the cores on my computer were dedicated to that R code. I could not even finish an email I started before running the code… So finally I left earlier, decided to pick up the kids after school, and went to the park, to enjoy the sunny day we had ! So I have to admit that running parallel codes can have advantages you could not think of !

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.