Site icon R-bloggers

Timing Grouped Mean Calculation in R

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

These timings are of the kind of small task large number of repetition breed that Matt Dowle writes against. So they at first wouldn’t seem that decisive. Except, look at the following:

Let’s try to reproduce these timings on a 2018 Dell XPS 13 Intel Core i5, 16GB Ram running Ubuntu 18.04, and also compare to some other packages: data.table and rqdatatable.

In this reproduction attempt we see:

However, Matt Dowle is also right: comparing at this scale doesn’t tell half the story we see when we try to summarize 10,000,000 rows down to 1,000,000. At this scale data.table still takes under half a second (time not much worth arguing over), yet dplyr takes 10 to 24 seconds! Or that dplyr is no faster than base::tapply() (despite many claims to the contrary).

All code for this benchmark is available here and here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.