rqdatatable: rquery Powered by data.table
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
rquery
is an R
package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL
and Apache Spark
. rqdatatable
is a new package that supplies a screaming fast implementation of the rquery
system in-memory using the data.table
package.
rquery
is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery
is also one of the fastest methods to wrangle data in-memory in R
(thanks to data.table
, via a thin adaption supplied by rqdatatable
).
Teaching rquery
and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery
and rqdatatable
, but frankly the biggest result of the benchmarking is how far out of the pack data.table
itself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table
user may be aware of.
The R
package development candidate rquery
0.5.0 incorporates a number of fixes and improvements. One interesting new feature is the DBI
package is now suggested or optional, instead of required. This means rquery
is ready to talk to non-DBI
big data systems such as SparkR
(example here) and it let us recruit a very exciting new rquery
service provider: data.table
!
data.table
is, by far, the fastest way to wrangle data at scale in-memory in R
. Our experience is that it starts to outperform base R
internals and all other packages at moderate data sizes such as mere tens or hundreds of rows. Of course data.table
is most famous for its performance in the millions of rows and gigabytes of data range.
However, because of the different coding styles there are not as many comparative benchmarks as one would like. So performance is often discussed as anecdotes or rumors. As a small step we are going to supply a single benchmark based on our “score a logistic regression by hand” problem from “Let’s Have Some Sympathy For The Part-time R User” (what each coding solution looks like can be found here).
In this note we compare idiomatic solutions to the example problem using: rquery
, data.table
, base R
(using stats::aggregate()
), and dplyr
. dplyr
is included due to its relevance and popularity. Full details of the benchmarking can be found here and full results here. One can always do more benchmarking and control for more in experiments. One learns more from a diversity of benchmarks than from critiquing any one benchmark, so we will work this example briefly and provide links to a few others benchmarks. Our measurements confirm the common (correct) observation and conclusion: that data.table
is very fast. Our primary new observation is that the overhead from the new rqdatatable
adapter is not too large and rqdatatable
is issuing reasonable data.table
commands.
Both the rquery
and dplyr
solutions can be run in multiple modalities: allowing the exact same code to be used in memory or on a remote big data system (a great feature, critical for low-latency rehearsal and debugging). These two systems can be run as follows.
-
rquery
is a system for describing operator trees. It deliberately does not implement the data operators, but depends on external systems for implementations. Previously any sufficiently standardSQL92
database that wasR
DBI
compliant could serve as a back-end or implementation. This already includes the industrial scale databasePostgreSQL
and the big data systemApache Spark
(via theSparklyR
package). The 0.5.0 development version ofrquery
relaxes theDBI
requirement (allowingrquery
to be used directly withSparkR
) and admits the possibility of non-SQL
based implementations. We have a newdata.table
based implementation in development as therqdatatable
package. -
dplyr
also allows multiple implementations (in-memory,DBI
SQL
, ordata.table
). We tested all three, and thedplyr
pipeline worked identically in-memory and withPostgreSQL
. However, thedtplyr
pipeline did not generate validdata.table
commands, due to an issue with window functions or ranking, so we were not able to time it usingdata.table
.
We are thus set up to compare to following solutions to the logistic scoring problem:
- A direct
data.table
solution running in memory. - The base
R
stats::aggregate()
solution working on in-memory data.frames. - The
rquery
solution using thedata.table
service providerrqdatatable
to run in memory. - The
rquery
solution sending data toPostgreSQL
, performing the work in database, and then pulling results back to memory. - The
dplyr
solution working directly on in-memorydata.frame
s. - The
dplyr
solution working directly on in-memorytibble:tbl
s (we are not counting any time for conversion). - The
dplyr
solution sending data toPostgreSQL
, performing the work in database, and then pulling results back to memory.
Running a 1,000,000 row by 13 column example can be summarized with the following graph.
The vertical dashed line is the median time that repeated runs of the base R
stats::aggregate()
solution took. We can consider results to the left of it as “fast” and results to the right of it as “slow.” Or in physical terms: data.table
and rquery
using data.table
each take about 1.5 seconds on average. Whereas dplyr
takes over 20 seconds on average. These two durations represent vastly different user experiences when attempting interactive analyses.
We have run some more tests to try to see how this is a function of problem scale (varying the number of rows of the data). Due to the large range (2 to 10,000,000 rows) we are using log scales, but they unfortunately are just not as readable as the linear scales.
What we can read off this graph includes:
data.table
is always the fastest system (or at worst indistinguishable from the fastest system) for this example,
at the scales of problems tested, and for this configuration and hardware.- The
data.table
backed version ofrquery
becomes comparable to nativedata.table
itself at around 100,000 rows. This is evidence the translation overhead is not too bad for this example and that the sequence
ofdata.table
commands issued byrqdatatable
are fairly good practicedata.table
. - The database backed version of
rquery
starts to outperformdplyr
at around 10,000 rows. Note: all database measurements include the overhead of moving the data to the database and then moving the results back toR
. This is slower than how one would normally use a database in production: with data starting and ending on the database and no data motion betweenR
and the database. dplyr
appears to be slower than the baseR
stats::aggregate()
solution at all measured scales (it is always above the shaded region).- It is hard to read, but changes in heights are ratios of runtimes. For example the
data.table
based solutions are routinely over 10 times faster that thedplyr
solutions once we get to 100,000 rows or more. This is an object size of only about 10 megabytes and is well below usual “usedata.table
once you are in the gigabytes range” advice.
Of course benchmarks depend on the example problems, versions, and machines- so results will vary. That being said, large differences often have a good chance of being preserved across variations of tests (and we share another grouped example here, and a join example here; for the join example dplyr
is faster at smaller problem sizes- so results do depend on task and scale).
We are hoping to submit the rquery
update to CRAN
in August and then submit rqdatatable
as a new CRAN
package soon after. Until then you can try both packages by a simple application of:
devtools::install_github("WinVector/rqdatatable")
These are new packages, but we think they can already save substantial development time, documentation time, debugging time, and machine investment in “R and big data” projects. Our group (Win-Vector LLC) is offering private training in rquery
to get teams up to speed quickly.
Note: rqdatatable
is an implementation of rquery
supplied by data.table
, not a data.table
scripting tool (as rqdatatable
does not support important data.table
features not found in rquery
, such as rolling joins).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.