Site icon R-bloggers

Workflows for querying databases via R

[This article was first published on rstats | Emily Riederer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simple, self-contained, reproducible examples are a common part of good software documentation. However, in the spirit of brevity, these examples often do not demonstrate the most sustainable or flexible workflows for integrating software tools into large projects. In this post, I document a few mundane but useful patterns for querying databases in R using the DBI package.

A prototypical example of forming and using a database connection with DBI might look something like this:

library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "diamonds", ggplot2::diamonds)
dat <- dbGetQuery(con, "select cut, count(*) as n from diamonds group by 1")
dat

#>         cut     n
#> 1      Fair  1610
#> 2      Good  4906
#> 3     Ideal 21551
#> 4   Premium 13791
#> 5 Very Good 12082

A connection is formed (in this case to a fake database that lives only in my computer’s RAM), the diamonds dataset from the ggplot2 package is written to the database (once again, this is for example purposes only; a real database would, of course, have data), and dbGetQuery() executes a query on the resulting table.

However, as queries get longer and more complex, this succinct solution becomes less attractive. Writing the query directly inside dbGetQuery() blurs the line between “glue code” (rote connection and execution) and our more nuanced, problem-specific logic. This makes the latter harder to extract, share, and version.

Below, I demonstrate a few alternatives that I find helpful in different circumstances such as reading queries that are saved separately (in different files or at web URLs) and forming increasingly complex query templates. Specifically, we’ll see how to:

Read query from separate file

A first enhancement is to isolate your SQL script in a separate file than the “glue code” that executes it. This improves readability and makes scripts more portable between projects. If a coworker who uses python or runs SQL through some other tool wishes to use your script, it’s more obvious which parts are relevant. Additionally, its easier to version control: we likely care far more about changes to the actual query than the boilerplate code that executes it so it feels more transparent to track them separately.

To do this, we can save our query in a separate file. We’ll call it query-cut.sql:

select
  cut,
  count(*) as n
from diamonds
group by 1

Then, in our script that pulls the data, we can read that other file with readLines() and give the results of that to the dbGetQuery() function.

library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
query <- paste(readLines("query-cut.sql"), collapse = "\n")
dat <- dbGetQuery(con, query)
dat

#>         cut     n
#> 1      Fair  1610
#> 2      Good  4906
#> 3     Ideal 21551
#> 4   Premium 13791
#> 5 Very Good 12082

Of course, if you wish you could define a helper function for the bulky paste(readLines(...)) bit.

read_source <- function(path) {paste(readLines(path), collapse = "\n")}

Read query from URL (like GitHub)

Sometimes, you might prefer that your query not live in your project at all. For example, if a query is used across multiple projects or if it changes frequently or is maintained by multiple people, it might live in a separate repository. In this case, the exact same workflow may be used if the path is replaced by a URL to a plain-text version of the query. (On GitHub, you may find such a link by clicking the “Raw” button when a file is pulled up.)

library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
url <- "https://raw.githubusercontent.com/emilyriederer/website/master/content/post/sql-r-flow/query-cut.sql"
query <- paste(readLines(url), collapse = "\n")
dat <- dbGetQuery(con, query)
dat

#>         cut     n
#> 1      Fair  1610
#> 2      Good  4906
#> 3     Ideal 21551
#> 4   Premium 13791
#> 5 Very Good 12082

This works because the query variable simply contains our complete text file read from the internet:

cat(query)

#> select
#>   cut,
#>   count(*) as n
#> from diamonds
#> group by 1

Alternatively, in an institutional setting you may find that you need some sort of authentication or proxy to access GitHub from R. In that case, you may retrieve the same query with an HTTP request instead using the httr package.

library(httr)

url <- "https://raw.githubusercontent.com/emilyriederer/website/master/content/post/sql-r-flow/query-cut.sql"
query <- content(GET(url))
cat(query)

#> select
#>   cut,
#>   count(*) as n
#> from diamonds
#> group by 1

Use query template

Separating the query from its actual execution also allows us to do query pre-processing. For example, instead of a normal query, we could write a query template with a wildcard variable. Consider the file template-cut.sql:

select
  cut,
  count(*) as n
from diamonds
where price < {max_price}
group by 1

This query continues to count the number of diamonds in our dataset by their cut classification, but now is has parameterized the max_price variable. Then, we may use the glue package to populate this template with a value of interest before executing the script.

library(DBI)
library(glue)

con <- dbConnect(RSQLite::SQLite(), ":memory:")
template <- paste(readLines("template-cut.sql"), collapse = "\n")
query <- glue(template, max_price = 500)
dat <- dbGetQuery(con, query)
dat

#>         cut   n
#> 1      Fair   7
#> 2      Good 226
#> 3     Ideal 628
#> 4   Premium 215
#> 5 Very Good 653

This is a useful alternative to databases that do not allow for local variables.

Compose more complex queries

The idea of templating opens up far more interesting possibilities. For example, consider a case where you wish to frequently create the same data structure for a different population of observations (e.g. a standard set of KPIs for different A/B test experiments, reporting for different business units, etc.)

A boilerplate part of the query could be defined as a template ready to accept a CTE or a subquery for a specific population of interest. For example, we could write a file template-multi.sql:

with
sample as ({query_sample}),
prices as (select id, cut, price from diamonds)
select prices.*
from
  prices
  inner join
  sample
  on
  prices.id = diamonds.id

Then our “glue code” can combine the static and dynamic parts of the query at runtime before executing.

template <- paste(readLines("template-multi.sql"), collapse = "\n")
query_sample <- "select * from diamonds where cut = 'Very Good' and carat < 0.25"
query <- glue(template, query_sample = query_sample)
cat(query)

#> with
#> sample as (select * from diamonds where cut = 'Very Good' and carat < 0.25),
#> prices as (select id, cut, price from diamonds)
#> select prices.*
#> from
#>   prices
#>   inner join
#>   sample
#>   on
#>   prices.id = diamonds.id

(Of course, this may seem like overkill and an unnecessarily inefficent query for the example above where a few more where conditions could have sufficed. But one can imagine more useful applications in a traditional setting where multiple tables are being joined.)

Query package

Finally, these queries and query templates could even be shipped as part of an R package. Additional text files may be placed in the inst/ directory and their paths discovered by system.file(). So, if your package myPkg were to contain the template-multi.sql file we saw above, you could provide a function to access it like so:

construct_query <- function(stub, ...) {
  
  path <- system.file(stub, package = 'myPkg')
  template <- paste(readLines(path), collapse = '\n')
  query <- glue::glue(template, ...)
  return(query)
  
}

Then, that function could be called like so:

sample <- "select * from diamonds where cut = 'Very Good' and carat < 0.25"
query <- construct_query("multi", query_sample = sample)

This approach has some benefits such as making it easier to share queries across users and benefit from package versioning and environment management standards. However, there are of course other risks; only dynamically generating queries could limit reproducibility or documentation about the actual query run to generate data. Thus, it might be a good idea to save the resulting query along with the resulting data:

dat <- dbGetQuery(con, query)
saveRDS(dat, "data.rds")
writeLines(query, "query-data.sql")

Bonus: Data Testing

Although unrelated to the previous workflow, another nice aspect of working with SQL through R is the ability to use R’s superior toolkit for creating small datasets on the fly for testing purposes.

Mocking data to easily test SQL can be a tedious exercise since you generally need to write out the dataset row-by-row:

INSERT INTO test_tbl
  ( x, y, z )
VALUES
  (1, 'A', NA), 
  (2, 'B', 0), 
  (3, 'C', 1);

This may be fine for a few rows, but it can easily get cumbersome when you require a mock dataset in the 10s or 20s of rows.

Of course, R has many helpful functions for generating data including sequences, predefined vectors (e.g. letters), and random number generators. This can make it easy to quickly generate data and push it to a database for testing SQL scripts:

n <- 26
test_df <- data.frame(
  x = 1:n,
  y = LETTERS,
  z = sample(c(0,1,NA), n, replace = TRUE)
)
dbWriteTable(con, "test_tbl", test_df)

To leave a comment for the author, please follow the link and comment on their blog: rstats | Emily Riederer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.