How to use the group_by function with your ecological data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In scientific data and experiments, we often have groups of subjects between which we want to compare an observed response. For example, we might want to compare the growth rates of plants under different light treatments. Or maybe we want to compare CO² emissions of different countries over time. Each of these scenarios requires you to group your data based on a certain variable before you can compare any kind of statistic such as mean, minimum, or maximum.
In this tutorial, I’m going to discuss how to use a handy function called group_by()
, which allows you to do what I just described.
group_by()
is part of the dplyr
package, so we’ll load that up first. Remember that if you haven’t used or installed the package before, you need to run install.packages("dplyr")
before loading it in your script. Let’s also load up a data set that comes with R, called Loblolly
.
# Load package library(dplyr) # Load data data(Loblolly) # View data head(Loblolly) ## height age Seed ## 1 4.51 3 301 ## 15 10.89 5 301 ## 29 28.72 10 301 ## 43 41.74 15 301 ## 57 52.70 20 301 ## 71 60.92 25 301
Loblolly
describes the height of Loblolly pine trees at different ages. “Height” is given in feet, “age” is given in years, and “seed” is a unique identifier for each tree.
How to use group_by() and summarise()
Let’s say we want to see the average height of loblolly pine trees within each of the age groups. To do that, we need to group our data by the variable “age”. We use the group_by()
function like this: group_by(data, column)
.
# Group the Loblolly data by tree age group_by(Loblolly, age) ## # A tibble: 84 × 3 ## # Groups: age [6] ## height age Seed ## <dbl> <dbl> <ord> ## 1 4.51 3 301 ## 2 10.9 5 301 ## 3 28.7 10 301 ## 4 41.7 15 301 ## 5 52.7 20 301 ## 6 60.9 25 301 ## 7 4.55 3 303 ## 8 10.9 5 303 ## 9 29.1 10 303 ## 10 42.8 15 303 ## # … with 74 more rows
When we do this, our data look the same. But behind the scenes, R makes note of how we want to group our data and returns a table that is grouped accordingly. In fact, our data look the same aside from the Groups: age [6]
labeled at the top of the table. However, after grouping the data, we can now apply functions that calculate summary statistics within each group using the function summarize()
, or summarise()
(the spelling depends on if you use British or American English).
summarise()
can be used like so: summarise(data, new_column_name = function(column_to_evaluate))
.
So if we wanted to summarize mean heights of trees, it would look like summarise(Loblolly, avgheight = mean(height))
.
# Group the Loblolly data by tree age and then summarize the mean, min, and max heights in each group group_by(Loblolly, age) %>% summarise(avgheight = mean(height), minheight = min(height), maxheight = max(height)) ## # A tibble: 6 × 4 ## age avgheight minheight maxheight ## <dbl> <dbl> <dbl> <dbl> ## 1 3 4.24 3.46 4.81 ## 2 5 10.2 9.03 11.4 ## 3 10 27.4 25.4 30.2 ## 4 15 40.5 37.8 44.4 ## 5 20 51.5 48.3 55.8 ## 6 25 60.3 56.4 64.1
In essence, summarise()
produces a new table that contains a column for your group, and then new columns of summary statistics that you define. In the code above, I asked summarise()
to create new columns called “avgheight” for the mean height of trees in each age group, “minheight” for the minimum, and “maxheight” for the maximum. After we summarize our data, dplyr
will also automatically ungroup our output.
You might be wondering about this guy %>%
in the code above. This operator is called a pipe, and it comes loaded with the dplyr
package. Importantly, this pipe doesn’t come with base R. For now, what you need to know about pipes are that they feed the output of one statement into the input of another. In the code above, the new table that came out of group_by()
was passed into the data
argument of summarise()
, so there was no need for me to write data = Loblolly
in the summarise()
function. In plain English, I asked the code to “group the Loblolly data by tree age, and then (pipe!) summarize those groups using their mean, max, and min”.
Pipes can make your code a lot cleaner, especially if you’re performing several operations on one data frame. Don’t worry, we have a more comprehensive tutorial post on pipes coming up soon.
group_by() and other dplyr functions
We just went over the summarise()
function, which is one of the most common dplyr functions to use with group_by()
. But you could also use other dplyr functions such as mutate()
and filter()
.
mutate()
For example, we could once again group our data by age, and then we could use mutate()
to create a new column for mean height.
# Group the Loblolly data by age and create a new column for average height by age group group_by(Loblolly, age) %>% mutate(age_avgheight = mean(height)) ## # A tibble: 84 × 4 ## # Groups: age [6] ## height age Seed age_avgheight ## <dbl> <dbl> <ord> <dbl> ## 1 4.51 3 301 4.24 ## 2 10.9 5 301 10.2 ## 3 28.7 10 301 27.4 ## 4 41.7 15 301 40.5 ## 5 52.7 20 301 51.5 ## 6 60.9 25 301 60.3 ## 7 4.55 3 303 4.24 ## 8 10.9 5 303 10.2 ## 9 29.1 10 303 27.4 ## 10 42.8 15 303 40.5 ## # … with 74 more rows
This essentially did the same thing as summarise()
, but instead of creating a new table, mutate()
just added this “age_avgheight” column to the original data set. You can see that for trees of the same age, the “age_avgheight” value is the same. This makes sense, since we grouped the data by age before taking the mean, and there should only be one mean height for each age group.
For functions like mutate()
and filter()
where we might want to keep working on the same data set afterwards, we need to ungroup()
the data after grouping it so that the grouping doesn’t affect other functions down the line. I’ll demonstrate quickly:
# Demonstrating ungrouping data and mutating a new column for average height group_by(Loblolly, age) %>% mutate(age_avgheight = mean(height)) %>% ungroup() %>% mutate(all_avgheight = mean(height)) ## # A tibble: 84 × 5 ## height age Seed age_avgheight all_avgheight ## <dbl> <dbl> <ord> <dbl> <dbl> ## 1 4.51 3 301 4.24 32.4 ## 2 10.9 5 301 10.2 32.4 ## 3 28.7 10 301 27.4 32.4 ## 4 41.7 15 301 40.5 32.4 ## 5 52.7 20 301 51.5 32.4 ## 6 60.9 25 301 60.3 32.4 ## 7 4.55 3 303 4.24 32.4 ## 8 10.9 5 303 10.2 32.4 ## 9 29.1 10 303 27.4 32.4 ## 10 42.8 15 303 40.5 32.4 ## # … with 74 more rows
After I ungrouped the data, I used mutate()
to create a new column for average height again. But this time, because the data is ungrouped, the “all_avgheight” column just contains the average height of all trees in the data set rather than by age group.
filter()
For the filter()
example, I’m going to remove a few rows of data from the Loblolly data set so that we can more clearly see the effect of the filter. If you want to follow along, you can copy and paste the following code into your script:
# Remove some rows at random (sort of) Loblolly <- Loblolly[-c(1, 2, 3, 4, 9, 10, 11, 17, 18, 22, 29, 30, 34, 35, 47, 55, 56, 70, 82, 83), ]
Now let’s see how to use filter()
with group_by()
. In our data set, we have 6 age classes for each tree: 3, 5, 10, 15, and 25. But because I removed several rows of data, we are now missing age data for some trees (e.g., for trees 301 and 303).
# Look at age classes sort(unique(Loblolly$age)) ## [1] 3 5 10 15 20 25 # View modified data head(Loblolly, 10) ## height age Seed ## 57 52.70 20 301 ## 71 60.92 25 301 ## 2 4.55 3 303 ## 16 10.92 5 303 ## 72 63.39 25 303 ## 3 4.79 3 305 ## 17 11.37 5 305 ## 31 30.21 10 305 ## 45 44.40 15 305 ## 4 3.91 3 307
Let’s say our data analysis requires that we have at least 5 age classes for each tree. In that case, we’ll have to eliminate all trees for which there are fewer than 5 ages. We can use group_by()
to group by Seed (the individual tree), then use filter()
to only include data that are in a group of at least 5. The function n()
will help us count the number of rows in each group.
# Filtering to include groups of at least 5 group_by(Loblolly, Seed) %>% filter(n() >= 5) %>% ungroup() ## # A tibble: 39 × 3 ## height age Seed ## <dbl> <dbl> <ord> ## 1 3.91 3 307 ## 2 9.48 5 307 ## 3 25.7 10 307 ## 4 50.8 20 307 ## 5 59.1 25 307 ## 6 4.32 3 315 ## 7 10.4 5 315 ## 8 27.2 10 315 ## 9 40.8 15 315 ## 10 51.3 20 315 ## # … with 29 more rows
We see that the data set is greatly reduced, and trees like 301 and 303 have been removed because they have fewer than 5 age classes. We can also run the opposite filter and only include data that are in a group of less than 5.
# Filtering to include groups of less than 5 group_by(Loblolly, Seed) %>% filter(n() < 5) %>% ungroup() ## # A tibble: 25 × 3 ## height age Seed ## <dbl> <dbl> <ord> ## 1 52.7 20 301 ## 2 60.9 25 301 ## 3 4.55 3 303 ## 4 10.9 5 303 ## 5 63.4 25 303 ## 6 4.79 3 305 ## 7 11.4 5 305 ## 8 30.2 10 305 ## 9 44.4 15 305 ## 10 4.81 3 309 ## # … with 15 more rows
Great! Now you’ve learned how to use the group_by()
function along with several of the main dplyr
functions summarise()
, mutate()
, and filter()
. I covered just a few ways you might use these functions; it’s up to you to play around with them and learn even more. And don’t forget to use ungroup()
!
Also be sure to check out R-bloggers for other great tutorials on learning R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.