Are parallel computations worth it ?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Yesterday, Daniel Marcelino published an interesting post on his blog, untitled Parallel Processing: When does it worth ? I was asking myself the same question for a chapter I am currently writing. And I did like his approach, so I tried, on my computer to do the same. I did use three packages to run parallel R codes,
> library(multicore) > library(snow) > library(snowfall)
and one to quantify time to run the code
> library(microbenchmark)
I ran the code on my mac, at the office,
> all=detectCores(all.tests=TRUE) > all [1] 4
which is a standard computer, with four cores. To run some codes, I had to generate datasets. Here, I consider a data frame, with rows, and 100 columns. I generate values using a Gaussian distribution,
> gen=function(n) data.frame(matrix(rnorm(n*100),n,100))
The goal, here, will be to compute quantiles (or to be more specific quartiles) per column, and to replicate that 100 times. Here, the standard technique is to use lapply. But two (at least) parallel version of the function can be found. So, let us use it
> base=gen(n=100) > microbenchmark( + mlapp=data.frame(lapply(base, quantile, probs = 1:3/4 )), + mclapp=data.frame(mclapply(base, quantile, probs = 1:3/4 , mc.cores = all)), + sflapp=data.frame(sfLapply(base, quantile, probs = 1:3/4 )), + times=100) -> m
For instance, with 100 rows, we have
> m Unit: milliseconds expr min lq median uq max 1 mclapp 50.19290 55.90364 57.99185 64.10619 266.88692 2 mlapp 26.94146 29.49396 31.20571 49.54824 75.60251 3 sflapp 27.54857 30.10224 31.41864 47.10688 59.28925
And with 500,000 rows, we have
> m Unit: seconds expr min lq median uq max 1 mclapp 42.999504 103.873919 161.989876 258.66887 660.2953 2 mlapp 3.720542 3.770319 4.070116 11.90181 166.9461 3 sflapp 3.587703 3.770399 4.027876 10.62654 181.0093
So yes, using parallel code would be very interesting ! Especially with very large datasets (I could not run it with 1 million rows). If we consider a loop, to see the evolution of the median time, for each of those three function, we can plot the time it took, as a function of the number of rows,
> i=1;vk=seq(1,6,by=.2) > col=seq(i,3*2,by=3) > plot(10^vk,db[2,col],ylim=range(db),col="white",log="x", + xlab="Number of rows",ylab="Time") + polygon(c(10^vk,rev(10^vk)),c(db[1,col],rev(db[3,col])),col="light blue",border=NA) + lines(10^vk,db[2,col],col="blue",lwd=2)
Here, we have the following, with the standard lapply on the left (the line if the median time, with quartiles, 25% and 75%), the multicore function in the middle, and the snowfall function, on the right,
If we zoom in, for small datasets (less than 10,000 rows and 100 columns), we do observe a gain, since the code ran two times faster
So clearly, it might be interesting to write codes to distribute on different cores. But here, I use a simple function (I compute quantiles on columns of a dataset). I should try with a more complex function…
On the other hand, I should mention that, usually, while I have have one (or two) codes running, I can do something else : seeking for recent papers for ongoing research projects, answer to emails that I should have answered a few weeks ago, checking for typos in the book and update the tex file, or type parts of a future posts on my blog, etc. The problem I got yesterday afternoon, when I ran the code, was that suddenly, all the cores on my computer were dedicated to that R code. I could not even finish an email I started before running the code… So finally I left earlier, decided to pick up the kids after school, and went to the park, to enjoy the sunny day we had ! So I have to admit that running parallel codes can have advantages you could not think of !
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.