Lesser known dplyr 0.7* tricks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This blog post is an update to an older one I wrote in March. In the post from March, dplyr
was at version 0.50, but since then a major update introduced some changes that make some of the tips in that post obsolete. So here I revisit the blog post from March by using dplyr
0.70.
Create new columns with mutate()
and case_when()
The basic things such as selecting columns, renaming them, filtering, etc did not change with this new version. What did change however is creating new columns using case_when()
. First, load dplyr
and the mtcars
dataset:
library("dplyr") data(mtcars)
This was how it was done in version 0.50 (notice the ‘.$’ symbol before the variable ‘carb’):
mtcars %>% mutate(carb_new = case_when(.$carb == 1 ~ "one", .$carb == 2 ~ "two", .$carb == 4 ~ "four", TRUE ~ "other")) %>% head(5) ## mpg cyl disp hp drat wt qsec vs am gear carb carb_new ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 four ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 four ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 two
This has been simplified to:
mtcars %>% mutate(carb_new = case_when(carb == 1 ~ "one", carb == 2 ~ "two", carb == 4 ~ "four", TRUE ~ "other")) %>% head(5) ## mpg cyl disp hp drat wt qsec vs am gear carb carb_new ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 four ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 four ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 two
No need for .$
anymore.
Apply a function to certain columns only, by rows, with purrrlyr
dplyr
wasn’t the only package to get an overhaul, purrr
also got the same treatment.
In the past, I applied a function to certains columns like this:
mtcars %>% select(am, gear, carb) %>% purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2 head(mtcars2)
Now, by_row()
does not exist in purrr
anymore, but instead a new package called purrrlyr
was introduced with functions that don’t really fit inside purrr
nor dplyr
:
mtcars %>% select(am, gear, carb) %>% purrrlyr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2 head(mtcars2) ## # A tibble: 6 x 4 ## am gear carb sum_am_gear_carb ## <dbl> <dbl> <dbl> <dbl> ## 1 1 4 4 9 ## 2 1 4 4 9 ## 3 1 4 1 6 ## 4 0 3 1 4 ## 5 0 3 2 5 ## 6 0 3 1 4
Think of purrrlyr
as purrr
s and dplyr
s love child.
Using dplyr
functions inside your own functions, or what is tidyeval
Programming with dplyr
has been simplified a lot. Before version 0.70
, one needed to use dplyr
in conjuction with lazyeval
to use dplyr
functions inside one’s own fuctions. It was not always very easy, especially if you mixed columns and values inside your functions. Here’s the example from the March blog post:
extract_vars <- function(data, some_string){ data %>% select_(lazyeval::interp(~contains(some_string))) -> data return(data) } extract_vars(mtcars, "spam")
More examples are available in this other blog post.
I will revisit them now with dplyr
’s new tidyeval
syntax. I’d recommend you read the Tidy evaluation vignette here. This vignette is part of the rlang
package, which gets used under the hood by dplyr
for all your programming needs. Here is the function I called simpleFunction()
, written with the old dplyr
syntax:
simpleFunction <- function(dataset, col_name){ dataset %>% group_by_(col_name) %>% summarise(mean_mpg = mean(mpg)) -> dataset return(dataset) } simpleFunction(mtcars, "cyl") ## # A tibble: 3 x 2 ## cyl mean_mpg ## <dbl> <dbl> ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
With the new synax, it must be rewritten a little bit:
simpleFunction <- function(dataset, col_name){ col_name <- enquo(col_name) dataset %>% group_by(!!col_name) %>% summarise(mean_mpg = mean(mpg)) -> dataset return(dataset) } simpleFunction(mtcars, cyl) ## # A tibble: 3 x 2 ## cyl mean_mpg ## <dbl> <dbl> ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
What has changed? Forget the underscore versions of the usual functions such as select_()
, group_by_()
, etc. Now, you must quote the column name using enquo()
(or just quo()
if working interactively, outside a function), which returns a quosure. This quosure can then be evaluated using !!
in front of the quosure and inside the usual dplyr
functions.
Let’s look at another example:
simpleFunction <- function(dataset, col_name, value){ filter_criteria <- lazyeval::interp(~y == x, .values=list(y = as.name(col_name), x = value)) dataset %>% filter_(filter_criteria) %>% summarise(mean_cyl = mean(cyl)) -> dataset return(dataset) } simpleFunction(mtcars, "am", 1) ## mean_cyl ## 1 5.076923
As you can see, it’s a bit more complicated, as you needed to use lazyeval::interp()
to make it work. With the improved dplyr
, here’s how it’s done:
simpleFunction <- function(dataset, col_name, value){ col_name <- enquo(col_name) dataset %>% filter((!!col_name) == value) %>% summarise(mean_cyl = mean(cyl)) -> dataset return(dataset) } simpleFunction(mtcars, am, 1) ## mean_cyl ## 1 5.076923
Much, much easier! There is something that you must pay attention to though. Notice that I’ve written:
filter((!!col_name) == value)
and not:
filter(!!col_name == value)
I have enclosed !!col_name
inside parentheses. I struggled with this, but thanks to help from @dmi3k and @_lionelhenry I was able to understand what was happening (isn’t the #rstats community on twitter great?).
One last thing: let’s make this function a bit more general. I hard-coded the variable cyl
inside the body of the function, but maybe you’d like the mean of another variable? Easy:
simpleFunction <- function(dataset, group_col, mean_col, value){ group_col <- enquo(group_col) mean_col <- enquo(mean_col) dataset %>% filter((!!group_col) == value) %>% summarise(mean((!!mean_col))) -> dataset return(dataset) } simpleFunction(mtcars, am, cyl, 1) ## mean((cyl)) ## 1 5.076923
«That’s very nice Bruno, but mean((cyl))
in the output looks ugly as sin» you might think, and you’d be right. It is possible to set the name of the column in the output using :=
instead of =
:
simpleFunction <- function(dataset, group_col, mean_col, value){ group_col <- enquo(group_col) mean_col <- enquo(mean_col) mean_name <- paste0("mean_", mean_col)[2] dataset %>% filter((!!group_col) == value) %>% summarise(!!mean_name := mean((!!mean_col))) -> dataset return(dataset) } simpleFunction(mtcars, am, cyl, 1) ## mean_cyl ## 1 5.076923
To get the name of the column I added this line:
mean_name <- paste0("mean_", mean_col)[2]
To see what it does, try the following inside an R interpreter (remember to us quo()
instead of enquo()
outside functions!):
paste0("mean_", quo(cyl)) ## [1] "mean_~" "mean_cyl"
enquo()
quotes the input, and with paste0()
it gets converted to a string that can be used as a column name. However, the ~
is in the way and the output of paste0()
is a vector of two strings: the correct name is contained in the second element, hence the [2]
. There might be a more elegant way of doing that, but for now this has been working well for me.
That was it folks! I do recommend you read the Programming with dplyr vignette here as well as other blog posts, such as the one recommended to me by @dmi3k here.
Have fun with dplyr 0.70
!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.