Aggregation with dplyr: summarise and summarise_each
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This article is an extract from the course “Efficient Data Manipulation with R” that the author, Andrea Spanò, kindly provided us.
Introduction
We use summarise()
with aggregate functions, which take a vector of values and return a single number. Function summarise_each()
offers an alternative approach to summarise()
with identical results.
This post aims to compare the behavior of summarise()
and summarise_each()
considering two factors we can take under control:
- How many variables to manipulate
- 1A. single variable
- 1B. more than a variable
- How many functions to apply to each variable
- 2A. single function
- 2B. more than one function
resulting in the following four cases:
- Case 1: apply one function to one variable
- Case 2: apply many functions to one variable
- Case 3: apply one function to many variables
- Case 4: apply many functions to many variables
These four cases will be also tested with and without a group_by()
option.
The mtcars
data frame
For this article we will use the well known mtcars
data frame.
We will first transform it into a tbl_df
object; no change will occur to the standard data.frame
object but a much better print method will be available.
Finally, to keep this article tidy and clean we will select only four variables of interest
mtcars <- mtcars %>% tbl_df() %>% select(cyl , mpg, disp)
Case 1: apply one function to one variable
In this case, summarise()
results the simplest candidate.
# without group mtcars %>% summarise (mean_mpg = mean(mpg))
## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062
# with group mtcars %>% group_by(cyl) %>% summarise (mean_mpg = mean(mpg))
## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
We could use function summarise_each()
as well but, its usage results in a loss of clarity.
# without group mtcars %>% summarise_each (funs(mean) , mean_mpg = mpg)
## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062
# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(mean) , mean_mpg = mpg)
## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000
Case 2: apply many functions to one variable
In this case we can use both functions summarise()
and summarise_each()
.
Function summarise()
has a more intuitive syntax:
# without group mtcars %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg))
## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9
# with group mtcars %>% group_by(cyl) %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg))
## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2
The names of the output variables can be specified in simple forms like: max_mpg = max(mpg)
When we apply many functions to one variable, the use of summarise_each()
provides a more compact and tidy notation:
# without group mtcars %>% summarise_each (funs(min, max), mpg)
## Source: local data frame [1 x 2] ## ## min max ## (dbl) (dbl) ## 1 10.4 33.9
# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(min, max), mpg)
## Source: local data frame [3 x 3] ## ## cyl min max ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2
The names of the output variables is given by the name of the functions: min
and max
. In this case we loose the name of the variable the function is applied to. If we prefer something like: min_mpg
and max_mpg
we shall rename the functions we call within funs()
:
# without group mtcars %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg)
## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9
# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg)
## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2
Case 3: apply one function to many variables
This case is very similar to case 2. Both functions summarise()
and summarise_each()
can be used
Function summarise()
has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)
# without group mtcars %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))
## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219
# with group mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp))
## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000
When we apply many functions to one variable, the use of summarise_each()
provides a more compact and tidy notation:
# without group mtcars %>% summarise_each(funs(mean) , mpg, disp)
## Source: local data frame [1 x 2] ## ## mpg disp ## (dbl) (dbl) ## 1 20.09062 230.7219
# with group mtcars %>% group_by(cyl) %>% summarise_each (funs(mean), mpg, disp)
## Source: local data frame [3 x 3] ## ## cyl mpg disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000
The names of the output variables is given by the name of the variables: mpg
and disp
. In this case we loose track of the name of the function applied to the variables: mean()
. Possibly we would prefer something like: mean_mpg
and mean_disp
. In order to achieve this result we shall appropriately rename the variables we pass to ...
within summarise_each()
:
# without group mtcars %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)
## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219
# with group mtcars %>% group_by(cyl) %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp)
## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000
Case 4: apply many functions to many variables
As in the previous cases both functions: summarise()
and summarise_each()
provide a valid alternative.
Function summarise()
has again a more intuitive syntax and the names of output variables can be specified in the usual simple form: max_mpg = max(mpg)
# without group mtcars %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))
## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472
# with a single group mtcars %>% group_by(cyl) %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp))
## Source: local data frame [3 x 5] ## ## cyl min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0
When we apply many functions to one variable, the use of summarise_each()
provides a more compact and tidy notation:
# without group mtcars %>% summarise_each(funs(min, max) , mpg, disp)
## Source: local data frame [1 x 4] ## ## mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472
# with a single group mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp)
## Source: local data frame [3 x 5] ## ## cyl mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0
The names of the output variables is given by the notation: variable_function
: i.e. mpg_mim
, disp_min
etc ...
.
Naming output variables with a different notation: i.e. function_variable
does not appear to be possible within the call tosummarise_each()
This goal has to be achieved with a separate instruction
# without group mtcars %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("min_mpg", "min_disp", "max_mpg", "max_disp"))
## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472
# with group mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("gear", "min_mpg", "min_disp", "max_mpg", "max_disp"))
## Source: local data frame [3 x 5] ## ## gear min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0
Conclusions
When using functions returning results of length one we have two possible candidate verbs:
summarise()
summarise_each()
Function summarise()
has a simpler syntax while function summarise_each()
has a more compact notation.
As a consequence, summarise()
seems more appropriate dealing with a single variable or a single function. The more the number of variables or functions increases, the more summarise_each()
becomes a better choice.
Function summarise_each()
has its own way to assign names to the output variables:
Case 2: apply many functions to one variable
The names of the output variables is given by the name of the functions. In this case we loose the name of the variable the function is applied to.
Case 3: apply one function to many variables
The names of the output variables is given by the name of the variables. In this case we loose track of the name of the function applied to the variables
Case 4: apply many functions to many variables
The names of the output variables is given by the notation: variable_function. Naming output variables with a different notation does not appear to be possible within the call to summarise_each()
The post Aggregation with dplyr: summarise and summarise_each appeared first on MilanoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.