Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; obviously the idea of reproducible research is an attempt to raise this standard). Production systems have to be durable: they have to remain correct as models, data, packages, users, and environments change over time.
Demonstration systems need merely glow in bright light among friends; production systems must be correct, even alone in the dark.
I have found: to deliver production worthy data science and predictive analytic systems, one has to develop per-team and per-project field tested recommendations and best practices. This is necessary even when, or especially when, these procedures differ from official doctrine.
What I want to do is share a single small piece of Win-Vector LLC‘s current guidance on using the R
package dplyr
.
-
Disclaimer: Win-Vector LLC has no official standing with RStudio, or
dplyr
development. -
However:
“One need not have been Caesar in order to understand Caesar.”
Alternately: Georg Simmmel or Max Webber.Win-Vector LLC, as a consultancy, has experience helping large companies deploy enterprise big data solutions involving
R
,dplyr
,sparklyr
, andApache Spark
. Win-Vector LLC, as a training organization, has experience in how new users perceive, reason about, and internalize how to useR
anddplyr
. Our group knows how to help deploy production grade systems, and how to help new users master these systems.
From experience we have distilled a lot of best practices. And below we will share one.
From: “R for Data Science; Whickham, Grolemund; O’Reilly, 2017” we have:
Note that you can refer to columns that you’ve just created:
mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours )
Let’s try that with database backed data:
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") # [1] ‘0.7.3’ db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") flights <- copy_to(db, nycflights13::flights, 'flights') mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) # # Source: lazy query [?? x 22] # # Database: sqlite 3.19.3 [:memory:] # year month day dep_time sched_dep_time ... # <int> <int> <int> <int> <int> ... # 1 2013 1 1 517 515 ... # ...
That worked. One of the selling points of dplyr
is a lot of dplyr
is source-generic or source-agnostic: meaning it can be run against different data providers (in-memory, databases, Spark
).
However, if a new user tries to extend such an example (say adding gain_per_minutes
) they run into this:
mutate(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours, gain_per_minute = 60 * gain_per_hour ) # Error in rsqlite_send_query(conn@ptr, statement) : # no such column: gain_per_hour
(Some detail on the failing query are here.)
It is hard for experts to understand how frustrating the above is to a new R
user or to a part time R
user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.
This quickly leads to strong feelings of learned helplessness and anxiety.
Our rule for dplyr::mutate()
has been for some time:
Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).
Under this rule neither of the above mutates()
are allowed. The second should be written as (switching to pipe-notation):
flights %>% mutate(gain = arr_delay - dep_delay, hours = air_time / 60) %>% mutate(gain_per_hour = gain / hours) %>% mutate(gain_per_minute = 60 * gain_per_hour)
And the above works.
If we teach this rule we can train users to be properly cautious, and hopefully avoid them becoming frustrated, scared, anxious, or angry.
dplyr
documentation (such as “help(mutate)
“) does not strongly commit to what order mutate expressions are executed in, or visibility and durability of intermediate results (i.e., a full description of intended semantics). Our rule intentionally limits the user to a set of circumstances where none of those questions matter.
Now the error we saw above is a mere bug that one expects will be fixed some day (in fact it is dplyr
issue 3095, we looked a bit at the generate queries here). It can be a bit unfair to criticize a package for having a bug.
However, confusion around re-use of column names has been driving dplyr
issues for quite some time:
dplyr
issue 3095dplyr
issue 2884dplyr
issue 2883dplyr
pull 2869dplyr
issue 2842dplyr
pull 2483dplyr
issue 2481dplyr
issue 2360
It makes sense to work in a reliable and teachable sub-dialect of dplyr
that will serve users well (or barring that, you can use an adapter, such as seplyr
). In production you must code to what systems are historically reliably capable of, not just the specification.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.