rquery: Fast Data Manipulation in R

John Mount

4 years ago

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

The graph above the distribution of repeated run times for:

data.table in memory: How long it takes data.table to complete the task starting from an in-memory data.frame. We will take data.table’s 256MS average runtime as the definition of “fast”, as data.table is commonly seen as the fastest system for in-memory data manipulation in R.
rquery in memory: How long it takes rquery to complete the task starting from an in-memory data.frame and returning to an in-memory data.frame. The actual implementation is performed by moving data to a PostgreSQL database for processing and then collecting the result back. The observed 352MS average runtime is 38% slower than data.table. However, as we will see, that is actually a good result.
dplyr tbl in memory: A standard dplyr in-memory pipeline working on a data.frame that is first converted into a tbl structure (to see if this is a significant speedup).
dplyr in memory no grouped filter: A re-structured dplyr pipeline avoiding filtering inside a grouped context.
dplyr from memory to db and back: Moved the data to a database, perform dplyr steps there, and then copy the data back. The runtime average is 602MS making it much faster than in-memory dplyr, but still 135% slower than data.table, 71% slower than rquery.

As is detailed in the experiment backing materials the task is processing 40,000 records through either of the following non-trivial pipelines (the first one for dplyr, and the second one for rquery):

dplyr_pipeline <- . %>%
  group_by(subjectID) %>%
    mutate(probability =
             exp(assessmentTotal * scale)/
             sum(exp(assessmentTotal * scale), na.rm = TRUE)) %>%
    arrange(probability, surveyCategory) %>%
    filter(row_number() == n()) %>%
    ungroup() %>%
    rename(diagnosis = surveyCategory) %>%
    select(subjectID, diagnosis, probability) %>%
    arrange(subjectID)


rquery_pipeline <- . := {
    extend_nse(.,
             probability :=
               exp(assessmentTotal * scale)/
               sum(exp(assessmentTotal * scale)),
             count := count(1),
             partitionby = 'subjectID') %.>%
    extend_nse(.,
               rank := rank(),
               partitionby = 'subjectID',
               orderby = c('probability', 'surveyCategory'))  %.>%
    rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
    select_rows_nse(., rank == count) %.>%
    select_columns(., c('subjectID', 
                        'diagnosis', 
                        'probability')) %.>%
    orderby(., 'subjectID') 
}

The primary purpose of rquery is pure in-database processing; in-memory processing is only a convenience. So let’s look at database only timings. In these tasks the data starts in the database (as is typical for large data projects) and is either consumed (by a row count called “count”) or saved as a database table (called “land”). In neither case does data move in or out of the database (skipping those overheads).

And, as we demonstrated earlier, rquery’s query generator has additional features (not shown here) that can yield additional speedup in production environments (and also increase query safety and query power).

I think results this good this early in the project are very promising. They are the product of some design decisions that I think will continue to pay off and some intentional scope restrictions (including that rquery only targets SQL engines, and initially only a few including Spark, PostgreSQL, and with some caveats SQLite). More work needs to be done to add all the SQL translation features needed for general production use, but we think we have demonstrated a good foundation to build on.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.