Site icon R-bloggers

My advice on dplyr::mutate()

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.

Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.


“Character is what you are in the dark.”

John Whorfin quoting Dwight L. Moody.

I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.

What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R package dplyr.

From experience we have distilled a lot of best practices. And below we will share one.

From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
   gain = arr_delay - dep_delay,
   hours = air_time / 60,
   gain_per_hour = gain / hours
 )

Let’s try that with database backed data:

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
# [1] ‘0.7.3’

db <- DBI::dbConnect(RSQLite::SQLite(), 
                     ":memory:")
flights <- copy_to(db, 
             nycflights13::flights,
             'flights')

mutate(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
)
# # Source:   lazy query [?? x 22]
# # Database: sqlite 3.19.3 [:memory:]
# year month   day dep_time sched_dep_time        ...
# <int> <int> <int>    <int>          <int>       ...
#   1  2013     1     1      517            515   ...
# ...

That worked. One of the selling points of dplyr is a lot of dplyr is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark).

However, if a new user tries to extend such an example (say adding gain_per_minutes) they run into this:

mutate(flights,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours,
       gain_per_minute = 60 * gain_per_hour
)
# Error in rsqlite_send_query(conn@ptr, statement) : 
#   no such column: gain_per_hour

(Some detail on the failing query are here.)

It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

This quickly leads to strong feelings of learned helplessness and anxiety.

Our rule for dplyr::mutate() has been for some time:

Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

Under this rule neither of the above mutates() are allowed. The second should be written as (switching to pipe-notation):

flights %>%
  mutate(gain = arr_delay - dep_delay,
         hours = air_time / 60) %>%
  mutate(gain_per_hour = gain / hours) %>%
  mutate(gain_per_minute = 60 * gain_per_hour)

And the above works.

If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.

dplyr documentation (such as “help(mutate)“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.

Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.

However, confusion around re-use of column names has been driving dplyr issues for quite some time:

It makes sense to work in a reliable and teachable sub-dialect of dplyr that will serve users well (or barring that, you can use an adapter, such as seplyr). In production you must code to what systems are historically reliably capable of, not just the specification.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.