The performance of dplyr blows plyr out of the water
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Together with many other packages written by Hadley Wickham, plyr
is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame
‘s into smaller summaries. The greatest disadvantage of plyr
is the performance. On StackOverflow, the answer is often that you want plyr
for the syntax, but that for real performance you need to use data.table
.
Recently, Hadley has released the successor to plyr
: dplyr
. dplyr
provides the kind of performance you would expect from data.table
, but with a syntax that leans closer to plyr
. The following example illustrates this performance difference:
library(plyr) library(dplyr) size = 10e6 no_levels = 25 dat = data.frame(num = runif(size), factor1 = rep(LETTERS[1:no_levels], each = size / no_levels), factor2 = rep(LETTERS[1:no_levels], size / no_levels)) # plyr solution system.time(summary_ddply <- ddply(dat, .(factor1, factor2), summarise, mn = mean(num))) # user system elapsed # 2.829 0.900 3.748 # dplyr solution data_per_factor = group_by(dat, factor1, factor2) system.time(summary_dplyr <- summarise(data_per_factor, mn = mean(num))) # user system elapsed # 0.097 0.000 0.098
In this case, dplyr
is about 38x faster. However, some log file processing I did recently was sped up by a factor of 1000. dplyr
is an exciting new development, that promises to be the single most influential new package since ggplot2
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.