Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is an issue I often face, so I thought it best to write it down. When doing data analysis, we often want to known how many observations there are in each subgroup. These subgroups can be defined by multiple variables. In the code example below, I want to know how many vehicles there are for each (cyl, gear) combination:
library(tidyverse)
data(mtcars)
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n())
# # A tibble: 8 x 3
# # Groups:   cyl [3]
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     5     2
If you look carefully, you will notice that there are no vehicles with cyl == 8 and gear == 4. In general it’s probably better to include this combination as a row in the tibble, with count as 0. This is especially important in data pipelines where future processes might expect there to be length(unique(cyl)) * length(unique(gear)) rows in the dataset.
We can achieve this by ungrouping the dataset and applying tidyr::complete(). This ensures that every possible (cyl, gear) combination gets a row.
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear)
# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4    NA
# 9     8     5     2
For rows that didn’t appear in the original summary table, complete() fills up the remaining columns with NA. We can specify the value complete() should use to fill in these cells with the fill option:
mtcars %>%
    group_by(cyl, gear) %>%
    summarize(count = n()) %>%
    ungroup() %>%
    complete(cyl, gear, fill = list(count = 0))
# # A tibble: 9 x 3
#     cyl  gear count
#   <dbl> <dbl> <int>
# 1     4     3     1
# 2     4     4     8
# 3     4     5     2
# 4     6     3     2
# 5     6     4     4
# 6     6     5     1
# 7     8     3    12
# 8     8     4     0
# 9     8     5     2
References:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
