How to Aggregate Data in R

Jake Hoare

4 years ago

[This article was first published on R – Displayr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The process involves two stages. First, collate individual cases of raw data together with a grouping variable. Second, perform which calculation you want on each group of cases. These two stages are wrapped into a single function.

To perform aggregation, we need to specify three things in the code:

The data that we want to aggregate
The variable to group by within the data
The calculation to apply to the groups (what you want to find out)

Example data

The raw data shown below consists of one row per case. Each case is an employee at a restaurant.

Load the example data by running the following R code:

 
library(flipAPI)
data = DownloadXLSX("https://wiki.q-researchsoftware.com/images/1/1b/Aggregation_data.xlsx", want.row.names = FALSE, want.data.frame = TRUE)

Perform aggregation with the following R code.

agg = aggregate(data,
                by = list(data$Role),
                FUN = mean)

This produces a table of the average salary and age by role, as below.

The aggregate function

The first argument to the function is usually a data.frame.

The by argument is a list of variables to group by. This must be a list even if there is only one variable, as in the example.

The FUN argument is the function which is applied to all columns (i.e., variables) in the grouped data. Because we cannot calculate the average of categorical variables such as Name and Shift, they result in empty columns, which I have removed for clarity.

Other aggregation functions

Any function that can be applied to a numeric variable can be used within aggregate. Maximum, minimum, count, standard deviation and sum are all popular.

For more specific purposes, it is also possible to write your own function in R and refer to that within aggregate. I’ve demonstrated this below where the second largest value of each group is returned, or the largest if the group has only one case. Note also that the groups are formed by Role and by Shift together.

second = function(x) {
            if (length(x) == 1)
                return(x)
            return(sort(x, decreasing = TRUE)[2])}

agg = aggregate(data,
                by = list(data$Role, data$Shift),
                FUN = second)

Additional features

The aggregate function has a few more features to be aware of:

Grouping variable(s) and variables to be aggregated can be specified with R’s formula notation.
Setting drop = TRUE means that any groups with zero count are removed.
na.action controls the treatment of missing values within the data.

TRY IT OUT
The analysis in this post was performed in Displayr using R. You can repeat or amend this analysis for yourself in Displayr.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.