Site icon R-bloggers

plyr and reshape: better, faster, more productive

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hadley Wickham has just released updates to his data-manipulation packages for Rplyr and reshape (now called reshape2), that are much faster and more memory-efficient than the previous incarnations. The reshape2 package lets you flexibly restructure and aggregate data using just three functions (melt, acast and dcast), whereas the plyr package is kind of like a supercharged SQL "GROUP BY" statement for R data frames.

One of the most interesting aspects of this update is that plyr can now parallelize its operations and make use of multiple processors simultaneously to speed up really big data-munging jobs. It makes use of Revolution's contributed foreach package, so whatever platform you're on (Windows, Linux, or Mac) you can specify a suitable parallel backend and take advantage of significant speedups on multiprocessor machines.

For example, on a 2-core Windows box can use the doSMP package from Revolution R to speed up a plyr call as follows:

require(doSMP)
workers <- startWorkers(2) # My computer has 2 cores
registerDoSMP(workers)

llply(my_data, aggr_function, .parallel=TRUE)

On Unix-like platforms (including Linux and Mac) you can use the doMC package for similar ends. Find more information about plyr at Hadley's website, below.

Had.co.nz: plyr 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.