Debugging Pipelines in R with Bizarro Pipe and Eager Assignment
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a note on debugging magrittr
pipelines in R
using Bizarro Pipe and eager assignment.
Pipes in R
The magrittr
R
package supplies an operator called “pipe” which is written as “%>%
“. The pipe operator is partly famous due to its extensive use in dplyr
and use by dplyr
users. The pipe operator is roughly described as allowing one to write “sin(5)
” as “5 %>% sin
“. It is described as being inspired by F#
‘s pipe-forward operator “|>” which itself is defined or implemented as:
let (|>) x f = f x
The magrittr
pipe doesn’t actually perform the above substitution directly. As a consequence “5 %>% sin
” is evaluated in a different environment than “sin(5)
” would be (unlike F#
‘s “|>
“), and the actual implementation is fairly involved.
The environment change is demonstrated below:
library("dplyr") f <- function(...) {print(parent.frame())} f(5) ## <environment: R_GlobalEnv> 5 %>% f ## <environment: 0x1032856a8>
Pipes are like any other coding feature: if you code with it you are eventually going to have to debug with it. Exact pipe semantics and implementation details are important when debugging, as one tries to control execution sequence and examine values and environments while debugging.
A Debugging Example
Consider the following example taken from the “Chaining” section of “Introduction to dplyr
“.
library("dplyr") library("nycflights13") flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) ## Adding missing grouping variables: `year`, `month`, `day` ## Source: local data frame [49 x 5] ## Groups: year, month [11] ## ## year month day arr dep ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 16 34.24736 24.61287 ## 2 2013 1 31 32.60285 28.65836 ## ...
A beginning dplyr
user might wonder at the meaning of the warning “Adding missing grouping variables: `year`, `month`, `day`
“. Similarly, a veteran dplyr
user may wonder why we bother with a dplyr::select()
, as selection is implied in the following dplyr::summarise()
; but this is the example code as we found it.
Using Bizarro Pipe
We can run down the cause of the warning quickly by performing the mechanical translation from a magrittr
pipeline to a Bizarro pipeline. This is simply making all the first arguments explicit with “dot” and replacing the operator “%>%
” with the Bizarro pipe glyph: “->.;
“.
We can re-run the modified code by pasting into R
‘s command console and the warning now lands much nearer the cause (even when we paste or execute the entire pipeline at once):
flights ->.; group_by(., year, month, day) ->.; select(., arr_delay, dep_delay) ->.; ## Adding missing grouping variables: `year`, `month`, `day` summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) ->.; filter(., arr > 30 | dep > 30) ## Source: local data frame [49 x 5] ## Groups: year, month [11] ## ## year month day arr dep ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 16 34.24736 24.61287 ## 2 2013 1 31 32.60285 28.65836 ## ...
We can now clearly see the warning was issued by dplyr::select()
(even though we just pasted in the whole block of commands at once). This means despite help(select)
saying “select()
keeps only the variables you mention” this example is depending on the (useful) accommodation that dplyr::select()
preserves grouping columns in addition to user specified columns (though this accommodation is not made for columns specified in dplyr::arrange()
).
A Caveat
To capture a value from a Bizarro pipe we must make an assignment at the end of the pipe, not the beginning. The following will not work as it would capture only the value after the first line (“flights ->.;
“) and not the value at the end of the pipeline.
One must not write:
VARIABLE <- flights ->.; group_by(., year, month, day)
To capture pipeline results we must write:
flights ->.; group_by(., year, month, day) -> VARIABLE
I think the right assignment is very readable if you have the discipline to only use pipe operators as line-enders, making assignments the unique lines without pipes. Also, leaving an extra line break after assignments helps with readability.
Making Things More Eager
A remaining issue is: Bizarro pipe only made the composition eager. For a data structure with additional lazy semantics (such as dplyr
‘s view of a remote SQL
system) we would still not have the warning near the cause.
Unfortunately different dplyr
backends give different warnings, so we can’t demonstrate the same warning here. We can, however, deliberately introduce an error and show how to localize errors in the presence of lazy eval data structures. In the example below I have misspelled “month” as “moth”. Notice the error is again not seen until printing, long after we finished composing the pipeline.
s <- dplyr::src_sqlite(":memory:", create = TRUE) flts <- dplyr::copy_to(s, flights) flts ->.; group_by(., year, moth, day) ->.; select(., arr_delay, dep_delay) ->.; summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) ->.; filter(., arr > 30 | dep > 30) ## Source: query [?? x 5] ## Database: sqlite 3.11.1 [:memory:] ## Groups: year, moth ## na.rm not needed in SQL: NULL are always droppedFALSE ## na.rm not needed in SQL: NULL are always droppedFALSE ## Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth
We can try to force dplyr
into eager evaluation using the eager value landing operator “replyr::`%->%`
” (from replyr
package) to form the “extra eager” Bizarro glyph: “%->%.;
“.
When we re-write the code in terms of the extra eager Bizarro glyph we get the following.
install.packages("replyr") library("replyr") flts %->%.; group_by(., year, moth, day) %->%.; ## Error in rsqlite_send_query(conn@ptr, statement) : no such column: moth select(., arr_delay, dep_delay) %->%.; summarise(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %->%.; ## na.rm not needed in SQL: NULL are always droppedFALSE ## na.rm not needed in SQL: NULL are always droppedFALSE filter(., arr > 30 | dep > 30) ## Source: query [?? x 5] ## Database: sqlite 3.11.1 [:memory:]
Notice we have successfully localized the error.
Nota Bene
One thing to be careful with in “dot debugging” is: when a statement such as dplyr::select()
errors-out this means
the Bizarro assignment on that line does not occur (normal R
exception semantics). Thus “dot” will be still carrying the value from the previous line, and the pasted block of code will continue after the failing line using this older data state found in “dot.” So you may see strange results and additional errors indicated in the pipeline. The debugging advice is: at most the first error message is trustworthy.
The Trick
The trick is to train your eyes to to read “->.;
” or “%->%.;
” as a single atomic or indivisible glyph, and not as a sequence of operators, variables, and separators. I see Bizarro pipe as a kind of strange superhero.
Conclusion
Pipes are a fun notation, and even the original magrittr
package experiments with a number of interesting variations of them. I hope you add Bizarro pipe (which turns out has been available in R
all along, without requiring any packages!) and extra eager Bizarro pipe to your debugging workflow.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.