Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I currently advise R
dplyr
users to take care when using multiple assignment dplyr::mutate()
commands on databases.
(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)
In this note I exhibit a troublesome example, and a systematic solution.
First let’s set up dplyr
, our database, and some example data.
library("dplyr")
## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
packageVersion("dplyr")
## [1] '0.7.4'
packageVersion("dbplyr")
## [1] '1.2.0'
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") d <- dplyr::copy_to( db, data.frame(xorig = 1:5, yorig = sin(1:5)), "d")
Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following.
d %>% mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta ) %>% select(-xorig, -yorig, -delta) %>% knitr::kable()
x0 | y0 | x1 | y1 | x2 | y2 |
---|---|---|---|---|---|
1 | 0.8414710 | 1 | 0.8414710 | 1 | 0.8414710 |
2 | 0.9092974 | 2 | 0.9092974 | 2 | 0.9092974 |
3 | 0.1411200 | 3 | 0.1411200 | 3 | 0.1411200 |
4 | -0.7568025 | 4 | -0.7568025 | 4 | -0.7568025 |
5 | -0.9589243 | 5 | -0.9589243 | 5 | -0.9589243 |
Notice the above gives an incorrect result: all of the x_i
columns are identical, and all of the y_i
columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr
code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr
project: you can not be certain you have not run into the problem producing silent data and result corruption.
The issue is: dplyr
on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta
” is taking only one value for the entire lifetime of the dplyr::mutate()
statement (which is clearly not what the user would want).
The fix is: break up the dplyr::mutate()
into a series of smaller mutates that don’t exhibit the problem. It is a trade-off breaking up dplyr::mutate()
on a database causes deeper statement nesting, and potential loss of performance. However, correct results should come before speed.
One automated variation of the fix is to use seplyr
‘s statement partitioner. seplyr
can factor the large mutate in a minimal number of very safe sub-mutates (and use dplyr
to execute them).
d %>% seplyr::mutate_se( seplyr::quote_mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta )) %>% select(-xorig, -yorig, -delta) %>% knitr::kable()
x0 | y0 | x1 | y1 | x2 | y2 |
---|---|---|---|---|---|
1 | 0.8414710 | 2 | 1.8414710 | 3 | 2.841471 |
2 | 0.9092974 | 3 | 1.9092974 | 4 | 2.909297 |
3 | 0.1411200 | 4 | 1.1411200 | 5 | 2.141120 |
4 | -0.7568025 | 5 | 0.2431975 | 6 | 1.243197 |
5 | -0.9589243 | 6 | 0.0410757 | 7 | 1.041076 |
The above notation is, however, a bit clunky for everyday use. We did not use the more direct seplyr::mutate_nse()
as we are (to lower maintenance effort) deprecating the direct non-standard evaluation methods in seplyr
in favor of code using seplyr::quote_mutate
or wrapr::qae()
.
One can instead use seplyr
as a code inspecting and re-writing tool with seplyr::factor_mutate()
.
cat(seplyr::factor_mutate( delta = 0, x0 = xorig + delta, y0 = yorig + delta, delta = delta + 1, x1 = xorig + delta, y1 = yorig + delta, delta = delta + 1, x2 = xorig + delta, y2 = yorig + delta ))
Warning in seplyr::factor_mutate(delta = 0, x0 = xorig + delta, y0 = yorig + : Mutate should be split into more than one stage. mutate(delta = 0) %>% mutate(x0 = xorig + delta, y0 = yorig + delta) %>% mutate(delta = delta + 1) %>% mutate(x1 = xorig + delta, y1 = yorig + delta) %>% mutate(delta = delta + 1) %>% mutate(x2 = xorig + delta, y2 = yorig + delta)
seplyr::factor_mutate()
both issued a warning and produced the factored code snippet seen above. We think this is in fact a different issue than explored in our prior note on dependency driven result corruption, and that fixes for the first issue did not fix this issue last time we looked.
And that why to continue to be careful when using multi assignment dplyr::mutate()
statements with database backed data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.