Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The data.table
R
package is really good at sorting. Below is a comparison of it versus dplyr
for a range of problem sizes.
The graph is using a log-log scale (so things are very compressed). But data.table
is routinely 7 times faster than dplyr
. The ratio of run times is shown below.
Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table
uses a radix sort which has the potential to perform in near linear time (faster than the n log(n)
lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).
In fact, if we divide the y
in the above graph by log(rows)
we get something approaching a constant.
The above is consistent with data.table
not only being faster than dplyr
, but also having a fundamentally different asymptotic running time.
Performance like the above is one of the reasons you should strongly consider data.table
for your R
projects.
All details of the timings can be found here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.