Site icon R-bloggers

Please inspect your dplyr+database code

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements.

If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() statement inside the same dplyr::mutate() statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()s).

I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.

To keep things in proportion: if you are not writing multi-assignment mutates on a dplyr database-backed system you can’t run into the problem (though, for performance, multi-statement mutates are preferred over database sources such as Apache Spark).

The issue has been reported to the dplyr team. And I presume a fix is in the works. However, one does not want to be distributing incorrect results in the interim. This is the advice I have been giving private clients. After some thought I have come to feel it would be unfair to withhold such advice from the larger R community. This is not meant to make dplyr look bad, but to try and help prevent both dplyr and dplyr users from unnecessarily looking bad.

To be clear: I am a proponent of dplyr plus database development (which is why I ran into this). Also, I am not affiliated with RStudio or affiliated with the dplyr development team.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.