Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R
libraries that assume you know the variable names. The R
data manipulation library dplyr
currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_
), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).
Better support for parametric treatment of variable names would be a boon to dplyr
users. To this end the replyr
package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr
code to be used as if it was parametric.
The Problem
dplyr
is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr
has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_
instead of dplyr::filter
). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.
As an attempted work-around replyr
now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr
code simply in terms of known or assumed column names).
It is easier to show than explain.
An Example
First we set up our libraries and type in some notional data as our example:
# install.packages('devtools') # Run this if you don't already have devtools
# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
d <- data.frame(Sepal_Length=c(5.8,5.7),Sepal_Width=c(4.0,4.4),
Species='setosa',rank=c(1,2))
print(d)
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 1
# 2 5.7 4.4 setosa 2
Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):
nmap <- c('GroupColumn','ValueColumn','RankColumn')
names(nmap) <- c('Species','Sepal_Length','rank')
print(nmap)
# Species Sepal_Length rank
# "GroupColumn" "ValueColumn" "RankColumn"
dtmp <- replyr_renameRestrictCols(d,nmap)
print(dtmp)
# GroupColumn ValueColumn RankColumn
# 1 setosa 5.8 1
# 2 setosa 5.7 2
At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr
. You can then do your work:
# pretend this block is a huge sequence of complicated and expensive operations.
dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
Notice we were able to use dplyr::mutate
without needing to use dplyr::mutate_
(and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again).
Once you have your desired result you restore the original names of our restricted column set:
invmap <- names(nmap)
names(invmap) <- as.character(nmap)
print(invmap)
# GroupColumn ValueColumn RankColumn
# "Species" "Sepal_Length" "rank"
replyr_renameRestrictCols(dtmp,invmap)
# Species Sepal_Length rank
# 1 setosa 5.8 0
# 2 setosa 5.7 1
If you haven’t worked a lot with dplyr
this won’t look that interesting. If you do work a lot with dplyr
you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_renameRestrictCols
. Be aware: replyr::replyr_renameRestrictCols
is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.
A Proposal
I feel the replyr::replyr_renameRestrictCols
interface represents the correct design for a better dplyr
based adapter.
I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr
:
view_as(df,columnNameMap)
takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.unview()
removes theview_as
annotation.
Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr
does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).
I would also suggest that dplyr::arrange
be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr
data sources would support three primary annotations:
Groups
: placed bydplyr::group_by
, removed bydplyr::ungroup
, and viewed bydplyr::groups
.Orders
: placed bydplyr::arrange
, removed byXdplyr::unarrange
(just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such asjoin
), and viewed byXdplyr::arrangement
.Column Views
: placed byXdplyr::view_as
, removed byXdplyr::unview
, and viewed byXdplyr::views
.
The “Xdplyr::
” items are the extensions that are being proposed.
An Alternative Proposal
Another possibility would be some sort of “let
” statement that controls name bindings for the duration of a block of code. Using such a statement we would write our above example calculation as:
let(
list(RankColumn='rank',GroupColumn='Species'),
{
# pretend this block is a huge sequence of complicated and expensive operations.
d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
}
)
The idea is the items 'rank'
and 'Species'
could be passed in parametrically (notice the let
specification is essentially invmap
, so we could just pass that in). This isn’t quite R
‘s “with
” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr
remote data sources (and the non-standard evaluation methods used to capture variables names). I haven’t experimented with gtools yet, so I don’t know if this is feasible (some R
macro examples can be found here).
If somebody has this working I would love to hear about it. The reason I worry is the following. Clever meta-programming depends on all of the rest of the code not being clever and not breaking referential transparency. And in this case dplyr
is using non-standard evaluation to capture variable names, which is typically not referentially transparent. Also R
functions are roughly Lisp
fexpr
s, which signals to me R
‘s design may not have a fully clean distinctions between functions and macros.
Conclusion
I feel the above methods will make working with parameterized variables in dplyr
much easier.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.