What is new for rquery December 2019
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Our goal has been to make rquery
the best query generation system for R
(and to make data_algebra
the best query generator for Python
).
Lets see what rquery
is good at, and what new features are making rquery
better.
The idea is: the query is a first class citizen that we can use to design and optimize queries prior to translating them into a data transform action via data.table
, SQL
, Pandas
, or other realizations.
For quite a while rquery
has had query-narrowing. Columns that are not used in the calculation are eliminated early. Here is an example.
library(rquery) ops <- mk_td( "d", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2 + col3) %.>% select_columns(., 'sum23') cat(format(ops))
## mk_td("d", c( ## "col1", ## "col2", ## "col3")) %.>% ## extend(., ## sum23 := col2 + col3) %.>% ## select_columns(., ## c('sum23'))
The above query (or operator DAG) represents working with a table that has columns col1
, col2
, col3
. The example is specifying adding a new derived column named sum23
and then limiting down to only this new column. We’ve tried to use operator names that evoke operator names used by Codd.
An important point is: the query is bound to a description of a data frame (or a schema), not bound to any one data frame. Thus we can re-use the query on new data.
The record-keeping in the query knows that only columns col2
and col2
are used.
columns_used(ops)
## $d ## [1] "col2" "col3"
This allows “query narrowing” where the unused columns are not specified in intermediate queries. This is easiest to see if we convert the query to SQL
.
ops %.>% to_sql( ., rquery::rquery_default_db_info()) %.>% cat(.)
## SELECT ## "sum23" ## FROM ( ## SELECT ## "col2" + "col3" AS "sum23" ## FROM ( ## SELECT ## "col2", ## "col3" ## FROM ## "d" ## ) tsql_76973382323412881950_0000000000 ## ) tsql_76973382323412881950_0000000001
Notice col1
is never referred to. This can be handy when working with tables with hundreds of columns.
And, using rqdatatable
we can use data.table
as another data action implementation.
library(rqdatatable) data.frame(col1 = 1, col2 = 2, col3 = 3) %.>% ops %.>% knitr::kable(.)
sum23 |
---|
5 |
rquery
now also has query-shortening. Some dead-values can be eliminated during query construction, before any calculations are attempted.
ops <- mk_td( "example_table", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2 + col3) %.>% extend(., x := 1) %.>% extend(., x := 2) %.>% extend(., x := 3) %.>% extend(., x := 4) %.>% extend(., x := 5) %.>% select_columns(., c('x', 'sum23')) cat(format(ops))
## mk_td("example_table", c( ## "col1", ## "col2", ## "col3")) %.>% ## extend(., ## sum23 := col2 + col3, ## x := 5) %.>% ## select_columns(., ## c('x', 'sum23'))
Obviously nobody would construct such a bad query, but it is nice that some of the “ick” is optimized automatically.
Both of the above optimizations are deliberately conservative. They are implemented to be correct (not give incorrect results), but are not necessarily super aggressive in eliminating all redundancy.
It is a bit long and technical. But both of these optimizations are easy due to the use of category theoretic ideas in the design of the rquery
and data_algebra
packages (I am working on some notes on this here).
The short form is: the rquery
/data_algebra
operators have an interpretation in a nice category over table schemas. The schema objects give us pre-condition and post-condition record keeping which enforces correct query composition and query narrowing. The generality of arrow composition gives us the freedom to place optimizations in the composition step. This gives us more options then systems that are restricted to list-concatenation or function composition/abstraction as their notion of composition. It also lets us enforce and check conditions early.
rquery
performs most of its checking during query construction. This can catch errors early and save a lot of development time.
ops_bad <- mk_td( "example_table", c("col1", "col2", "col3")) %.>% extend(., sum23 := col2_MISSPELLED + col3)
## Error in check_have_cols(src_columns, required_cols, "rquery::extend"): rquery::extend unknown columns col2_MISSPELLED
Notice an error was raised during query construction. We didn’t have to wait to supply data or translate to SQL
.
Let’s take a look at the SQL
translation of our final example query.
ops %.>% to_sql( ., rquery::rquery_default_db_info()) %.>% cat(.)
## SELECT ## "x", ## "sum23" ## FROM ( ## SELECT ## "col2" + "col3" AS "sum23", ## 5 AS "x" ## FROM ( ## SELECT ## "col2", ## "col3" ## FROM ## "example_table" ## ) tsql_28722584463189084716_0000000000 ## ) tsql_28722584463189084716_0000000001
There are some more things we would wish optimized away, such as both the inner and outer select. But the SQL
is reasonably short, due to the intermediate stages that were optimized out of the original query. Later versions of the system will pick these up, and likely these are also easy for downstream SQL
optimizers to eliminate.
An important point: optimizations performed during query construction are shared among all back-ends: data.table
, SQL
, and Pandas
.
Please consider giving rquery
a try.
Appendix
We often get asked “why bother with rquery
, given dplyr
was first.” I’d say: if you are happy with dplyr
don’t worry about rquery
. Though I would add: you really owe it to yourself to check out data.table
, it is by far the best data manipulation system in R
.
However, let’s take a look how dbplyr
generates a similar SQL
query.
library(dplyr)
## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
packageVersion("dplyr")
## [1] '0.8.3'
library(dbplyr)
## ## Attaching package: 'dbplyr' ## The following objects are masked from 'package:dplyr': ## ## ident, sql
packageVersion("dbplyr")
## [1] '1.4.2'
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_to( con, data.frame(col1 = 1, col2 = 2, col3 = 3), name = 'd') tbl(con, 'd') %>% mutate(sum23 := col2 + col3) %>% mutate(x := 1) %>% mutate(x := 2) %>% mutate(x := 3) %>% mutate(x := 4) %>% mutate(x := 5) %>% select(x, sum23) %>% show_query()
## <SQL> ## SELECT 5.0 AS `x`, `sum23` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x` ## FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23` ## FROM `d`)))))
The dplyr
SQL
query appears to have neither query narrowing nor query shortening. Again, a downstream SQL
optimizer may be able to eliminate these steps (or it may not). However, it also would be desirable to have these sort of eliminations available when using data.table
through dtplyr
.
Also, dbplyr
does not seem to catch errors until compute()
or print()
are called.
tbl(con, 'd') %>% mutate(sum23 := col2_MISSPELLED + col3) %>% show_query()
## <SQL> ## SELECT `col1`, `col2`, `col3`, `col2_MISSPELLED` + `col3` AS `sum23` ## FROM `d`
The above SQL
refers to a non-existent column col2_MISSPELLED
. The query construction and SQL
generation steps did not signal any error. Depending on how many queries and steps are before this, this could delay finding this mistake by quite a while (especially when using a high latency SQL
engine such as Apache Spark
).
DBI::dbDisconnect(con)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.