Site icon R-bloggers

rquery: Fast Data Manipulation in R

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

The graph above the distribution of repeated run times for:

As is detailed in the experiment backing materials the task is processing 40,000 records through either of the following non-trivial pipelines (the first one for dplyr, and the second one for rquery):

dplyr_pipeline <- . %>%
  group_by(subjectID) %>%
    mutate(probability =
             exp(assessmentTotal * scale)/
             sum(exp(assessmentTotal * scale), na.rm = TRUE)) %>%
    arrange(probability, surveyCategory) %>%
    filter(row_number() == n()) %>%
    ungroup() %>%
    rename(diagnosis = surveyCategory) %>%
    select(subjectID, diagnosis, probability) %>%
    arrange(subjectID)


rquery_pipeline <- . := {
    extend_nse(.,
             probability :=
               exp(assessmentTotal * scale)/
               sum(exp(assessmentTotal * scale)),
             count := count(1),
             partitionby = 'subjectID') %.>%
    extend_nse(.,
               rank := rank(),
               partitionby = 'subjectID',
               orderby = c('probability', 'surveyCategory'))  %.>%
    rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
    select_rows_nse(., rank == count) %.>%
    select_columns(., c('subjectID', 
                        'diagnosis', 
                        'probability')) %.>%
    orderby(., 'subjectID') 
}

The primary purpose of rquery is pure in-database processing; in-memory processing is only a convenience. So let’s look at database only timings. In these tasks the data starts in the database (as is typical for large data projects) and is either consumed (by a row count called “count”) or saved as a database table (called “land”). In neither case does data move in or out of the database (skipping those overheads).

And, as we demonstrated earlier, rquery’s query generator has additional features (not shown here) that can yield additional speedup in production environments (and also increase query safety and query power).

I think results this good this early in the project are very promising. They are the product of some design decisions that I think will continue to pay off and some intentional scope restrictions (including that rquery only targets SQL engines, and initially only a few including Spark, PostgreSQL, and with some caveats SQLite). More work needs to be done to add all the SQL translation features needed for general production use, but we think we have demonstrated a good foundation to build on.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.