Dynamic column/variable names with dplyr using Standard Evaluation functions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data manipulation works like a charm in R when using a library like dplyr. An often overlooked feature of this library is called Standard Evaluation (SE) which is also described in the vignette about the related Non-standard Evaluation. It basically allows you to use dynamic arguments in many dplyr functions (“verbs”).
When is this useful?
In dplyr you directly specify the columns you want to work with directly without quoting them (i.e. without turning them into a character string):
# works: mtcars %>% select(mpg, cyl) # does not work: mtcars %>% select('mpg', 'cyl') # -> Error: All select() inputs must resolve to integer column positions.
This is called Non-standard Evaluation (NSE). It’s good because it saves typing, but at the same time you can’t easily use dynamic arguments as you could by using strings. Dynamic arguments are after necessary when you write loops that perform the same type of data manipulation one-by-one for different columns/variables. More generally, you need dynamic arguments when you’re writing functions that do not just solve a problem for a specific data set or a specific column in a data set, but should work with several kinds of data sets or columns (see also the Don’t Repeat Yourself (DRY) Principle).
How to use the SE-versions of dplyr verbs
Here Standard Evaluation (SE) comes into effect. The SE-versions of dplyr verbs always end with an underscore, for example select_()
or group_by_()
:
# using the SE-version select_() # now this works: mtcars %>% select_('mpg', 'cyl')
To pass a dynamically specified set of arguments to a SE-enabled dplyr function, we need to use the special .dots
argument and pass it a list of strings:
# this is the same as above: mtcars %>% select_(.dots = list('mpg', 'cyl'))
Of course this doesn’t make sense so far, because it is not really “dynamic”. As an easy example, let’s say we want to select individual columns and print the first rows. We defined a list of lists vars
and loop through it. Each v
in vars
is a list of arguments passed to select_()
.
vars <- list(list('cyl', 'mpg'), list('vs', 'disp')) for (v in vars) { print(mtcars %>% select_(.dots = v) %>% head) } cyl mpg Mazda RX4 6 21.0 Mazda RX4 Wag 6 21.0 Datsun 710 4 22.8 Hornet 4 Drive 6 21.4 Hornet Sportabout 8 18.7 Valiant 6 18.1 vs disp Mazda RX4 0 160 Mazda RX4 Wag 0 160 Datsun 710 1 108 Hornet 4 Drive 1 258 Hornet Sportabout 0 360 Valiant 1 225
Let’s make something more practical. For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_()
, e.g. mean_mpg = mean(mpg)
using string concatenation and setNames()
:
summarise_vars <- list(list('cyl', 'mpg'), list('vs', 'disp')) for (v in summarise_vars) { group_var <- v[1] # group by this variable summ <- paste0('mean(', v[2], ')') # construct summary method, e.g. mean(mpg) summ_name <- paste0('mean_', v[2]) # construct summary variable name, e.g. mean_mpg print(paste('grouping by', group_var, 'and summarising', summ)) df_summ <- mtcars %>% group_by_(.dots = group_var) %>% summarise_(.dots = setNames(summ, summ_name)) print(df_summ) } # output [1] "grouping by cyl and summarising mean(mpg)" # A tibble: 3 × 2 cyl mean_mpg 1 4 26.66364 2 6 19.74286 3 8 15.10000 [1] "grouping by vs and summarising mean(disp)" # A tibble: 2 × 2 vs mean_disp 1 0 307.1500 2 1 132.4571
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.