Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The title may seem tautological, but since the arrival of dplyr
0.7.x, there have been some efforts at using dplyr
without actually using it that I can’t quite understand. The tidyverse
has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr
, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.
Take the example of seplyr
. It stands for standard evaluation dplyr
, and enables us to program over dplyr
without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang
/tidyeval
”. Let’s consider the following interactive pipeline:
library(dplyr) starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr
. But is it? Not at all: we just need to add one line and a bang-bang (!!
):
starwars_mean <- function(var) { var <- enquo(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean(homeworld) ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
The enquo()
function quotes the name we put in our function (homeworld
), and the bang-bang unquotes and uses that name instead of var
. That’s it. What about seplyr
? With seplyr
, we just have to (and I quote)
- Change dplyr verbs to their matching seplyr “*_se()” adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
- Replace “=” in expressions with “:=”.
This is the result:
library(seplyr) starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo()
with as.name()
to convert the string to a name:
starwars_mean <- function(var) { var <- as.name(var) starwars %>% group_by(!!var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
But we can do even better if we remember that dplyr
provides scoped variants (see ?dplyr::scoped
) for most of the verbs. In this case, group_by_at()
comes in handy:
starwars_mean <- function(var) { starwars %>% group_by_at(var) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n()) } starwars_mean("homeworld") ## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr
and R’s paste0()
function to build up expressions:
grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") data %>% group_by_se(grouping_variables) %>% summarize_se(c(result_names := expressions, "count" := "n()")) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## <chr> <dbl> <dbl> <int> ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11
And the same with dplyr
’s scoped verbs (note that I’ve added the last rename_at()
on a whim, just to get exactly the same output as before, but it is not really necessary):
grouped_mean <- function(data, grouping_variables, value_variables) { data %>% group_by_at(grouping_variables) %>% mutate(count = n()) %>% summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>% rename_at(value_variables, funs(paste0("mean_", .))) } starwars %>% grouped_mean("eye_color", c("mass", "birth_year")) ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## <chr> <dbl> <dbl> <dbl> ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11
Wrapping up, the tidyeval
paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr
is full of tools that will make your life easier, not harder.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.