Query Generation in R

Posted on February 16, 2019 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

Introduction

SQL represents value use by nesting. To use a query result within another query one writes the query to be used inside the query using the values. R query generators can also represent value use by nesting, but they also tend to expose a sequential pipe notation where value use is represented by sequencing.

We will demonstrate this with an example.

library("rquery")

raw_connection <- DBI::dbConnect(RSQLite::SQLite(), 
                                 ":memory:")

db_rquery <- rquery_db_info(
  connection = raw_connection,
  is_dbi = TRUE,
  connection_options = rq_connection_tests(raw_connection))

d <- data.frame(x = -3:3)

d0 <- rq_copy_to(db_rquery, "d", d)
print(d0)

## [1] "table(`d`; x)"

ops <- d0 %.>%
  extend(., y = x^2 -1) %.>%
  select_rows(., x > y)

cat(format(ops))

## table(`d`; 
##   x) %.>%
##  extend(.,
##   y := x ^ 2 - 1) %.>%
##  select_rows(.,
##    x > y)

Notice how in the rquery operation we write operations one after another (using the dot-arrow pipe notation), and the operations are similarly printed.

Whereas the generated SQL is written in a nested format.

cat(to_sql(ops, db_rquery))

## SELECT * FROM (
##  SELECT
##   `x`,
##   POWER ( `x` , 2 ) - 1  AS `y`
##  FROM (
##   SELECT
##    `x`
##   FROM
##    `d`
##   ) tsql_38280614414821583912_0000000000
## ) tsql_38280614414821583912_0000000001
## WHERE `x` > `y`

Notice the database handle is kept separate from the operators. Furthermore execution is not entangled with operator definition, but is a separate step (performed through materialze(), execute() or even sending the query to the database via a pipe).

ops %.>%
  db_rquery %.>%
  knitr::kable(.)

x	y
0	-1
1	0

Re-using derived values

The piped operator notation runs into trouble if we re-use derived values. SQL can re-use tables and views, but can not conveniently re-use queries. So if use query results more than once the notation expands our directed acyclic graph specification into a possibly much larger tree.

The following is an artificial example to demonstrate the issue.

d1 <- natural_join(d0, d0, 
                   by = "x", jointype = "LEFT")
d2 <- natural_join(d1, d1, 
                   by = "x", jointype = "LEFT")
d3 <- natural_join(d2, d2, 
                   by = "x", jointype = "LEFT")

cat(format(d3))

## table(`d`; 
##   x) %.>%
##  natural_join(.,
##   table(`d`; 
##     x),
##   j= LEFT, by= x) %.>%
##  natural_join(.,
##   table(`d`; 
##     x) %.>%
##    natural_join(.,
##     table(`d`; 
##       x),
##     j= LEFT, by= x),
##   j= LEFT, by= x) %.>%
##  natural_join(.,
##   table(`d`; 
##     x) %.>%
##    natural_join(.,
##     table(`d`; 
##       x),
##     j= LEFT, by= x) %.>%
##    natural_join(.,
##     table(`d`; 
##       x) %.>%
##      natural_join(.,
##       table(`d`; 
##         x),
##       j= LEFT, by= x),
##     j= LEFT, by= x),
##   j= LEFT, by= x)

Notice the depth 3 expression exploded into tree with 7 joins.

rquery‘s query diagrammer can help spot and diagnose these issues.

d3 %.>%
  op_diagram(., merge_tables = TRUE) %.>% 
  DiagrammeR::grViz(.) %.>%
  DiagrammeRsvg::export_svg(.) %.>%
  write(., file="query_growth_diagram.svg")

## Warning in op_diagram(., merge_tables = TRUE): possible repeated calculation:
##  natural_join(.1, .2,  j= LEFT, by= x)

The gold nodes are possibly repeated calculations, and the warning also notes the issue.

The above example may seem unnatural- but there are a number of places where it is natural to re-use intermediate results. One such circumstance is comparing values in groups without the use of window functions (as demonstrated here).

The query explosion issue is not unique to rquery, dplyr has the same issue.

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

d0_dplyr <- tbl(raw_connection, "d")

d1_dplyr <- left_join(d0_dplyr, d0_dplyr, 
                      by = "x")
d2_dplyr <- left_join(d1_dplyr, d1_dplyr, 
                      by = "x")
d3_dplyr <- left_join(d2_dplyr, d2_dplyr, 
                      by = "x")
dbplyr::remote_query(d3_dplyr)

## <SQL> SELECT `TBL_LEFT`.`x` AS `x`
##   FROM (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM `d` AS `TBL_LEFT`
##   LEFT JOIN `d` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_LEFT`
##   LEFT JOIN (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM `d` AS `TBL_LEFT`
##   LEFT JOIN `d` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_LEFT`
##   LEFT JOIN (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM `d` AS `TBL_LEFT`
##   LEFT JOIN `d` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_LEFT`
##   LEFT JOIN (SELECT `TBL_LEFT`.`x` AS `x`
##   FROM `d` AS `TBL_LEFT`
##   LEFT JOIN `d` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`x` = `TBL_RIGHT`.`x`)

One could hope the query optimizer will detect and eliminate the common sub-expressions, but that is not always going to be the case. In fact sometimes the very size of a query turns off the query optimizer in systems such as Spark. It is better to organize your calculation to not emit so many common sub-expressions in the first place. We share here an example showing explicit value re-use on Spark (preventing a crash in dplyr, but actually slowing down rquery– implying the Spark optimizer is very good when it survives). And we also share a variation of the example here using PostreSQL where landing intermediate results speeds up both dplyr and rquery (dplyr being faster in this particular example; which is not always the case and may be the case, and here may be due to rquery‘s natural_join() coalesce semantics).

Basic solutions

dplyr can easily overcome this limitation with its compute() node, which lands or materializes intermediate results for re-use.

d1_dplyr <- compute(left_join(d0_dplyr, d0_dplyr, 
                              by = "x"))
d2_dplyr <- compute(left_join(d1_dplyr, d1_dplyr, 
                              by = "x"))
d3_dplyr <- compute(left_join(d2_dplyr, d2_dplyr, 
                              by = "x"))
dbplyr::remote_query(d3_dplyr)

## <SQL> SELECT *
## FROM `ofsgqnxcku`

Notice the dplyr::compute() results are actually tables (not general queries).

rquery can also fix the issue by landing intermediate results, though the table lifetime tracking is intentionally more explicit (and the separation between code and results is also deliberately much more explicit).

tmps <- wrapr::mk_tmp_name_source()

d1_mat <- materialize(
  db_rquery,
  natural_join(d0, d0, 
               by = "x", jointype = "LEFT"),
  table_name = tmps(), 
  temporary = TRUE, overwrite = TRUE)
d2_mat <- materialize(
  db_rquery,
  natural_join(d1_mat, d1_mat, 
               by = "x", jointype = "LEFT"),
  table_name = tmps(), 
  temporary = TRUE, overwrite = TRUE)
d3_mat <- materialize(
  db_rquery,
  natural_join(d2_mat, d2_mat, 
               by = "x", jointype = "LEFT"),
  table_name = tmps(), 
  temporary = TRUE, overwrite = TRUE)
cat(format(d3_mat))

## table(`tmpnam_82654876444053153969_0000000002`; 
##   x)

x
-3
-2
-1
0
1
2
3

Conclusion

rquery is an advanced SQL query generator for R that includes a number of tools for managing complex queries. If your queries are taking substantial development time or substantial run time you should definitely consider trying the rquery system. For convenience there is also data.table based implementation of the rquery grammar called rqdatatable (which tends to be much faster that dplyr).

# clean up tmps
intermediates <- tmps(dumpList = TRUE)
for(ti in intermediates) {
  rquery::rq_remove_table(db_rquery, ti)
}

DBI::dbDisconnect(raw_connection)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Query Generation in R

Introduction

Re-using derived values

Basic solutions

More advanced solutions

Conclusion

Related

Introduction

Re-using derived values

Basic solutions

More advanced solutions

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)