Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

John Mount

5 years ago

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a note on debugging magrittr pipelines in R using Bizarro Pipe and eager assignment.

Pipes in R

The magrittr R package supplies an operator called “pipe” which is written as “%>%“. The pipe operator is partly famous due to its extensive use in dplyr and use by dplyr users. The pipe operator is roughly described as allowing one to write “sin(5)” as “5 %>% sin“. It is described as being inspired by F#‘s pipe-forward operator “|>” which itself is defined or implemented as:

    let (|>) x f = f x

The magrittr pipe doesn’t actually perform the above substitution directly. As a consequence “5 %>% sin” is evaluated in a different environment than “sin(5)” would be (unlike F#‘s “|>“), and the actual implementation is fairly involved.

The environment change is demonstrated below:

library("dplyr")
f <- function(...) {print(parent.frame())}

f(5)
## <environment: R_GlobalEnv>

5 %>% f
## <environment: 0x1032856a8>

Pipes are like any other coding feature: if you code with it you are eventually going to have to debug with it. Exact pipe semantics and implementation details are important when debugging, as one tries to control execution sequence and examine values and environments while debugging.

A Debugging Example

Consider the following example taken from the “Chaining” section of “Introduction to dplyr“.

library("dplyr")
library("nycflights13")

flights %>%
    group_by(year, month, day) %>%
    select(arr_delay, dep_delay) %>%
    summarise(
        arr = mean(arr_delay, na.rm = TRUE),
        dep = mean(dep_delay, na.rm = TRUE)
    ) %>%
    filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
## Source: local data frame [49 x 5]
## Groups: year, month [11]
## 
##     year month   day      arr      dep
##    <int> <int> <int>    <dbl>    <dbl>
## 1   2013     1    16 34.24736 24.61287
## 2   2013     1    31 32.60285 28.65836
## ...

A beginning dplyr user might wonder at the meaning of the warning “Adding missing grouping variables: `year`, `month`, `day`“. Similarly, a veteran dplyr user may wonder why we bother with a dplyr::select(), as selection is implied in the following dplyr::summarise(); but this is the example code as we found it.

Using Bizarro Pipe

We can run down the cause of the warning quickly by performing the mechanical translation from a magrittr pipeline to a Bizarro pipeline. This is simply making all the first arguments explicit with “dot” and replacing the operator “%>%” with the Bizarro pipe glyph: “->.;“.

We can re-run the modified code by pasting into R‘s command console and the warning now lands much nearer the cause (even when we paste or execute the entire pipeline at once):

flights ->.;
  group_by(., year, month, day) ->.;
  select(., arr_delay, dep_delay) ->.;
## Adding missing grouping variables: `year`, `month`, `day`
  summarise(.,
          arr = mean(arr_delay, na.rm = TRUE),
          dep = mean(dep_delay, na.rm = TRUE)
  ) ->.;
  filter(., arr > 30 | dep > 30)
## Source: local data frame [49 x 5]
## Groups: year, month [11]
## 
##     year month   day      arr      dep
##    <int> <int> <int>    <dbl>    <dbl>
## 1   2013     1    16 34.24736 24.61287
## 2   2013     1    31 32.60285 28.65836
## ...

We can now clearly see the warning was issued by dplyr::select() (even though we just pasted in the whole block of commands at once). This means despite help(select) saying “select() keeps only the variables you mention” this example is depending on the (useful) accommodation that dplyr::select() preserves grouping columns in addition to user specified columns (though this accommodation is not made for columns specified in dplyr::arrange()).

A Caveat

To capture a value from a Bizarro pipe we must make an assignment at the end of the pipe, not the beginning. The following will not work as it would capture only the value after the first line (“flights ->.;“) and not the value at the end of the pipeline.

One must not write:

VARIABLE <- 
  flights ->.;
  group_by(., year, month, day)

To capture pipeline results we must write:

flights ->.;
  group_by(., year, month, day) -> VARIABLE

I think the right assignment is very readable if you have the discipline to only use pipe operators as line-enders, making assignments the unique lines without pipes. Also, leaving an extra line break after assignments helps with readability.

Making Things More Eager

A remaining issue is: Bizarro pipe only made the composition eager. For a data structure with additional lazy semantics (such as dplyr‘s view of a remote SQL system) we would still not have the warning near the cause.

Unfortunately different dplyr backends give different warnings, so we can’t demonstrate the same warning here. We can, however, deliberately introduce an error and show how to localize errors in the presence of lazy eval data structures. In the example below I have misspelled “month” as “moth”. Notice the error is again not seen until printing, long after we finished composing the pipeline.

s <- dplyr::src_sqlite(":memory:", create = TRUE)                                 
flts <- dplyr::copy_to(s, flights)

flts ->.;
  group_by(., year, moth, day) ->.;
  select(., arr_delay, dep_delay) ->.;
  summarise(.,
          arr = mean(arr_delay, na.rm = TRUE),
          dep = mean(dep_delay, na.rm = TRUE)
          ) ->.;
  filter(., arr > 30 | dep > 30)

## Source:   query [?? x 5]
## Database: sqlite 3.11.1 [:memory:]
## Groups: year, moth

## na.rm not needed in SQL: NULL are always droppedFALSE
## na.rm not needed in SQL: NULL are always droppedFALSE
##  Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth

We can try to force dplyr into eager evaluation using the eager value landing operator “replyr::`%->%`” (from replyr package) to form the “extra eager” Bizarro glyph: “%->%.;“.

When we re-write the code in terms of the extra eager Bizarro glyph we get the following.

install.packages("replyr")
library("replyr")

flts %->%.;
  group_by(., year, moth, day) %->%.;
## Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth
  select(., arr_delay, dep_delay) %->%.;
  summarise(.,
          arr = mean(arr_delay, na.rm = TRUE),
          dep = mean(dep_delay, na.rm = TRUE)
          ) %->%.;
## na.rm not needed in SQL: NULL are always droppedFALSE
## na.rm not needed in SQL: NULL are always droppedFALSE
  filter(., arr > 30 | dep > 30)
## Source:   query [?? x 5]
## Database: sqlite 3.11.1 [:memory:]

Notice we have successfully localized the error.

Nota Bene

One thing to be careful with in “dot debugging” is: when a statement such as dplyr::select() errors-out this means
the Bizarro assignment on that line does not occur (normal R exception semantics). Thus “dot” will be still carrying the value from the previous line, and the pasted block of code will continue after the failing line using this older data state found in “dot.” So you may see strange results and additional errors indicated in the pipeline. The debugging advice is: at most the first error message is trustworthy.

The Trick

The trick is to train your eyes to to read “->.;” or “%->%.;” as a single atomic or indivisible glyph, and not as a sequence of operators, variables, and separators. I see Bizarro pipe as a kind of strange superhero.

Conclusion

Pipes are a fun notation, and even the original magrittr package experiments with a number of interesting variations of them. I hope you add Bizarro pipe (which turns out has been available in R all along, without requiring any packages!) and extra eager Bizarro pipe to your debugging workflow.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.