Site icon R-bloggers

rquery Substitution

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.

strings/character versus names/symbols

Let’s take a look at this through small examples. First let’s take a look at the difference between strings and symbols in R.

col_string <- "x"
col_name <- as.name(col_string)
str(col_string)
##  chr "x"
str(col_name)
##  symbol x

Notice, in R a string is different than a symbol.

We can see this difference in rquery where an un-quoted x is treated as a symbol (and therefore is translated to a database column) and a quoted entity is treated as a string (and therefore is translated to a literal or constant, not to a column).

library("rquery")

d <- data.frame(x = c('a', 'b'),
                stringsAsFactors = FALSE)

d_rep <- local_td(d)

db_info <- rquery_db_info(identifier_quote_char = "__IDENTIFIER__",
                          string_quote_char = "__STRING_CONSTANT__")
# direct use, comparing to a string constant
# probaly not the query we intend as the
# result is going to be empty independent
# of the data.
cat(to_sql(
  d_rep %.>% select_rows(., is.na('x')),
  db_info))
## Warning in warn_about_filter_conditions(parsed): rquery::select_rows:
## expression is.na("x") refers to no columns (so is a constant)

## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_22528812269810523566_0000000000
## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL )

(The warning is new to rquery version 1.3.2.)

We take careful note what is marked as "__IDENTIFIER__", versus what is marked as "__STRING_CONSTANT__". Notice "__IDENTIFIER__" is used in the SQL for table names and column name, and "__STRING_CONSTANT__" is used for string constants. The above query is probably not what a user intended as we are checking if a user supplied string constant is NA, which is not interesting.

Likely the correct query omits the quote marks from the x.

# direct use, comparing to a column
cat(to_sql(
  d_rep %.>% select_rows(., is.na(x)),
  db_info))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_46294476393859703536_0000000000
## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )

In the above query we are now comparing an identifier to NULL, which is how SQL expresses comparing the contents of the column named to NULL in a row by row fashion (a useful query).

Or combing the two ideas. We check which rows of the column x have the value "a" as follows.

cat(to_sql(
  d_rep %.>% select_rows(., x == 'a'),
  db_info))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_37960424657179884082_0000000000
## WHERE __IDENTIFIER__x__IDENTIFIER__ = __STRING_CONSTANT__a__STRING_CONSTANT__

wrapr::let() substitution

wrapr::let() substitution is designed only to substitute in names as if the user had typed them. It is deliberately not designed to deal with other value substitutions (such as strings, integers, or floating point values). This is intentional and to keep wrapr::let() to one job: adapting NSE (Non-standard interfaces) to accept names as values.

wrapr::let()‘s principle is that there is no reason for wrapr::let() to ever substitute in a value (such as a string or an integer) as normal evaluation of variable names in environments already supplies a better way to do that. The only thing that is hard to substitute in are new symbols, so wrapr::let() has code to make sure it is doing only that.

Accordingly wrapr::let() treats both names/symbols and strings as symbols.

# Let substitution treats all substitutions as source-text
# so strings and names are as if the user had typed them
# in and behave as names (becoming the name of a column).
let(c(COL_STRING = col_string),
    cat(to_sql(d_rep %.>% select_rows(., is.na(COL_STRING)),
               db_info)))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_90437192531349864130_0000000000
## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )
# Let substitution treats all substitutions as source-text
# so strings and names are as if the user had typed them
# in and behave as names (becoming the name of a column).
let(c(COL_NAME = col_name),
    cat(to_sql(d_rep %.>% select_rows(., is.na(COL_NAME)),
               db_info)))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_97451251523316128028_0000000000
## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )

wrapr::let()‘s operating assumption is: if the user was using wrapr::let() the user was intending a symbol, regardless if they specify that symbol using a string or a symbol type. This means the user doesn’t have to maintain the distinction between string representations of names and symbol representations of names when using wrapr::let(). And again, for substituting string-values in: there are already much better ways, such as R evaluation itself (as we show below).

value_we_want <- "a"

let(c(COL_NAME = col_name),
    cat(to_sql(d_rep %.>% select_rows(., COL_NAME == value_we_want),
               db_info)))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_39226801511562936942_0000000000
## WHERE __IDENTIFIER__x__IDENTIFIER__ = __STRING_CONSTANT__a__STRING_CONSTANT__

By assuming more about user intent wrapr::let() can smooth over inessential differences for the user.

base::bquote() substitution

bquote() substitution on the other hand is designed to substitute arbitrary values into un-executed language objects. This is the usual general definition of quasi-quotation, and is an emergent behavior. That we see the behavior one would expect by simply composing existing R language features. bquote() is what you get when you write reasonable code and then accept the resulting behavior as reasonable (even if the resulting behavior may or may not have been your first choice). This is in fact also a good design principle.

In this case the emergent behavior is: strings are treated as string constants, and names/symbols are treated as column names. That is the consequences of the substitution performed by bquote() is a function of the type of what is being substituted in. This actually makes sense, but it is something the user has to learn.

rquery can use bquote() substitution two ways: through its own NSE methods, or through wrapr:qe() (wrapr quote expression). Both work the same: they treat names/symbols as column names, and character/strings as string constants. So users must express their intent by passing in the correct type.

Here are examples to show the differences. In all cases substitution is triggered by the .()-notation.

# bquote substitution on string type: col_string 
# is taken to represent a string constant, not
# the name of a column.
cat(to_sql(d_rep %.>% select_rows(., is.na(.(col_string))),
           db_info))
## Warning in warn_about_filter_conditions(parsed): rquery::select_rows:
## expression is.na("x") refers to no columns (so is a constant)

## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_48031370941868950818_0000000000
## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL )
# bquote substitution on name type: col_name 
# is taken to represent a column name.
cat(to_sql(d_rep %.>% select_rows(., is.na(.(col_name))),
           db_info))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_31583525690015673048_0000000000
## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )
# bquote substitution on string type: col_string 
# is taken to represent a string constant, not
# the name of a column.
cat(to_sql(d_rep %.>% select_rows_se(., qe(is.na(.(col_string)))),
           db_info))
## Warning in warn_about_filter_conditions(parsed): rquery::select_rows:
## expression is.na("x") refers to no columns (so is a constant)

## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_04319820330673360614_0000000000
## WHERE ( ( __STRING_CONSTANT__x__STRING_CONSTANT__ ) IS NULL )
# bquote substitution on name type: col_name 
# is taken to represent a column name.
cat(to_sql(d_rep %.>% select_rows_se(., qe(is.na(.(col_name)))),
           db_info))
## SELECT * FROM (
##  SELECT
##   __IDENTIFIER__x__IDENTIFIER__
##  FROM
##   __IDENTIFIER__d__IDENTIFIER__
## ) tsql_61704616114459293739_0000000000
## WHERE ( ( __IDENTIFIER__x__IDENTIFIER__ ) IS NULL )

Conclusion

wrapr::let() behavior is an example of a forced design: a desirable effect is identified (in this case the ability to substitute in names from variables) and the implementation guarantees this effect. Because the implementation is attempting complete control of semantics we can precisely determine user visible effects. We can bend the implementation to our teaching. wrapr::let() is working around the R language, but deliberately doing so in a very narrow way (we are not re-implementing all of the evaluation path!).

base::bquote() behavior is an example of an emergent design: the code that is natural to get the desired functionality is written, and the exact consequences and details of the implementation are derived from the underlying language semantics. Because the implementation is not trying to work around the underlying language the semantics tend to be good and compatible with other parts of the language.

Both strategies are valid and have their advantages. I feel this in contrast to systems that re-implement very many (or even every) step of expression representation and evaluation. Once one overrides and re-implements all aspects of representation and evaluation one has two incompatible languages (the original and the overridden) bolted together to great confusion.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.