Dynamic column/variable names with dplyr using Standard Evaluation functions

Posted on September 27, 2016 by Markus Konrad in R bloggers | 0 Comments

[This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data manipulation works like a charm in R when using a library like dplyr. An often overlooked feature of this library is called Standard Evaluation (SE) which is also described in the vignette about the related Non-standard Evaluation. It basically allows you to use dynamic arguments in many dplyr functions (“verbs”).

When is this useful?

In dplyr you directly specify the columns you want to work with directly without quoting them (i.e. without turning them into a character string):

# works:
mtcars %>% select(mpg, cyl)

# does not work:
mtcars %>% select('mpg', 'cyl')
# -> Error: All select() inputs must resolve to integer column positions.

This is called Non-standard Evaluation (NSE). It’s good because it saves typing, but at the same time you can’t easily use dynamic arguments as you could by using strings. Dynamic arguments are after necessary when you write loops that perform the same type of data manipulation one-by-one for different columns/variables. More generally, you need dynamic arguments when you’re writing functions that do not just solve a problem for a specific data set or a specific column in a data set, but should work with several kinds of data sets or columns (see also the Don’t Repeat Yourself (DRY) Principle).

How to use the SE-versions of dplyr verbs

Here Standard Evaluation (SE) comes into effect. The SE-versions of dplyr verbs always end with an underscore, for example select_() or group_by_():

# using the SE-version select_()
# now this works:
mtcars %>% select_('mpg', 'cyl')

To pass a dynamically specified set of arguments to a SE-enabled dplyr function, we need to use the special .dots argument and pass it a list of strings:

# this is the same as above:
mtcars %>% select_(.dots = list('mpg', 'cyl'))

Of course this doesn’t make sense so far, because it is not really “dynamic”. As an easy example, let’s say we want to select individual columns and print the first rows. We defined a list of lists vars and loop through it. Each v in vars is a list of arguments passed to select_().

vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))
for (v in vars) {
  print(mtcars %>% select_(.dots = v) %>% head)
}

                  cyl  mpg
Mazda RX4           6 21.0
Mazda RX4 Wag       6 21.0
Datsun 710          4 22.8
Hornet 4 Drive      6 21.4
Hornet Sportabout   8 18.7
Valiant             6 18.1
                  vs disp
Mazda RX4          0  160
Mazda RX4 Wag      0  160
Datsun 710         1  108
Hornet 4 Drive     1  258
Hornet Sportabout  0  360
Valiant            1  225

Let’s make something more practical. For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_(), e.g. mean_mpg = mean(mpg) using string concatenation and setNames():

summarise_vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))

for (v in summarise_vars) {
  group_var <- v[1]   # group by this variable
  summ <- paste0('mean(', v[2], ')')  # construct summary method, e.g. mean(mpg)
  summ_name <- paste0('mean_', v[2])  # construct summary variable name, e.g. mean_mpg

  print(paste('grouping by', group_var, 'and summarising', summ))

  df_summ <- mtcars %>%
    group_by_(.dots = group_var) %>%
    summarise_(.dots = setNames(summ, summ_name))

  print(df_summ)
}
# output
[1] "grouping by cyl and summarising mean(mpg)"
# A tibble: 3 × 2
    cyl mean_mpg

1     4 26.66364
2     6 19.74286
3     8 15.10000
[1] "grouping by vs and summarising mean(disp)"
# A tibble: 2 × 2
     vs mean_disp

1     0  307.1500
2     1  132.4571

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – WZB Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Dynamic column/variable names with dplyr using Standard Evaluation functions

When is this useful?

How to use the SE-versions of dplyr verbs

Related

When is this useful?

How to use the SE-versions of dplyr verbs

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)