Site icon R-bloggers

How to avoid empty line charts

[This article was first published on Albert Rapp, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • Have you ever tried to create a line chart with ggplot only to find that the chart remains blank and a warning showing up that each group consists out of one observation? That’s a common struggle and I used to attempt a lot of trial and error to fix this. In today’s blog post we’re going to take a look what’s going on when this warning message occurs and how to fix that like a pro. Like always, you can find a video version of this blog post on YouTube:

    < section id="generate-some-fake-data" class="level2">

    Generate some fake data

    First, what we need is data to work with. Here, I’m just going to simulate a bit of random data. And at the end of the day, the exact data isn’t that important. Just know that all of the data sets I use in this blog post are simulated.

    You can find the code for that simulation in this folded code chunk. Feel free to check that out if you’re curious.

    < details class="code-fold"> < summary>Code
    library(tidyverse)
    
    
    
    set.seed(234)
    dat <- tibble(
      age = runif(10000, 1, 100) |> round(),
      age_group = case_when(
        age < 18 ~ '<18',
        between(age, 18, 30) ~ '18 - 30',
        between(age, 30, 40) ~ '30 - 40',
        between(age, 40, 50) ~ '40 - 50',
        between(age, 50, 60) ~ '50 - 60',
        between(age, 60, 70) ~ '60 - 70',
        TRUE ~ '>70'
      ),
      value = 50^2 - (age - 50) ^2 + rnorm(10000, mean = 0, sd = 100),
      line = sample(LETTERS[1:5], 10000, replace = TRUE),
      fruit = sample(fruit[1:3], 10000, replace = TRUE)
    ) 
    
    mean_value_by_age <- dat |> 
      summarise(
        mean_value = mean(value),
        .by = age
      )
    
    mean_value_by_age_group <- dat |> 
      summarise(
        mean_value = mean(value),
        .by = age_group
      )
    
    
    mean_value_by_age_group_and_line <- dat |> 
      summarise(
        mean_value = mean(value),
        .by = c(age_group, line)
      )
    
    mean_value_by_age_group_and_line_and_fruit <- dat |> 
      summarise(
        mean_value = mean(value),
        .by = c(age_group, line, fruit)
      )

    Also let us set the stage for ggplot by setting a nicer default theme:

    theme_set(
      theme_minimal(
        base_family = 'Source Sans Pro',
        base_size = 16
      ) +
        theme(
          panel.grid.minor = element_blank()
        )
    )
    < section id="a-line-chart-where-everything-works" class="level2">

    A line chart where everything works

    Let us first look at the mean_value_by_age data. It has a numeric column age and a numeric column mean_value.

    mean_value_by_age
    ## # A tibble: 100 × 2
    ##      age mean_value
    ##    <dbl>      <dbl>
    ##  1    75      1874.
    ##  2    78      1719.
    ##  3     3       296.
    ##  4     8       715.
    ##  5    65      2297.
    ##  6    93       646.
    ##  7    72      1998.
    ##  8    29      2071.
    ##  9    56      2465.
    ## 10    55      2481.
    ## # ℹ 90 more rows

    In the following examples we want to create line charts with the mean_value as the thing that goes on the y axis. This works pretty smoothly when the x-axis uses something numeric.

    mean_value_by_age |> 
      ggplot(aes(x = age, y = mean_value)) +
      geom_line()

    < section id="a-classical-example-where-the-plot-remains-blank" class="level2">

    A classical example where the plot remains blank

    But now look at the data set mean_value_by_age_group. Instead of a numeric variable age it now uses a character vector age_group.

    mean_value_by_age_group
    ## # A tibble: 7 × 2
    ##   age_group mean_value
    ##   <chr>          <dbl>
    ## 1 >70            1187.
    ## 2 <18             811.
    ## 3 60 - 70        2253.
    ## 4 18 - 30        1812.
    ## 5 50 - 60        2463.
    ## 6 40 - 50        2469.
    ## 7 30 - 40        2287.

    Typically, I’d try to create a bar chart from this.

    But for this blog post let’s see what happens when we want to create a similar chart as before but with age_group instead of age on the x-axis.

    mean_value_by_age_group |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line()
    ## `geom_line()`: Each group consists of only one observation.
    ## ℹ Do you need to adjust the group aesthetic?

    Oh no. That didn’t work particularly nice. Unfortunately, geom_line() needs you to be very specific when the x-axis is not a numeric variable. You will need to tell geom_line() what points belong together across the x-axis.

    < section id="setting-the-group-aesthetic" class="level2">

    Setting the group aesthetic

    With numeric variables like age there’s a natural order and geom_line() acts like all the numbers belong to the same continuum of numbers. But with other kind of data geom_line() will act like it doesn’t know anything. That’s why you can tell it that all of the points can be connected across the x-axis. That’s where group comes in. Just map it to the same string for all observations.

    mean_value_by_age_group |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = ''))

    Cool. That worked pretty nicely. But notice that the things on the x-axis are not in a natural order. Here, geom_line() just sorts things alphabetically. We can change that by hard-coding a new order with the factor() function.

    mean_value_by_age_group |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = ''))

    < section id="groupings-and-multiple-lines" class="level2">

    Groupings and multiple lines

    Now, what if you wanted to have multiple lines? Imagine you have a new data set that also has a column line in it.

    mean_value_by_age_group_and_line
    ## # A tibble: 35 × 3
    ##    age_group line  mean_value
    ##    <chr>     <chr>      <dbl>
    ##  1 >70       E          1189.
    ##  2 >70       B          1191.
    ##  3 <18       B           827.
    ##  4 >70       D          1195.
    ##  5 60 - 70   B          2256.
    ##  6 >70       C          1187.
    ##  7 18 - 30   E          1829.
    ##  8 50 - 60   E          2457.
    ##  9 50 - 60   B          2468.
    ## 10 <18       A           821.
    ## # ℹ 25 more rows

    Here’s how our previous code would look if we just replaced the data set.

    mean_value_by_age_group_and_line |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = ''))

    Notice how we only get one jagged line. The thing is, we didn’t tell geom_line() which of the things should form seperatea lines. Instead group = '' still makes sure that all the observations should belong to the same line.

    So that’s why geom_line() tries to do its best and connect all the observations. In effect, we get a jagged line. Instead, we can tell geom_line() to map the group aesthetic to our new column line.

    mean_value_by_age_group_and_line |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = line))

    Nice, we got seperate lines know. Now, imagine that we had mapped line to the color aesthetic. It’s easy to think that geom_line() would understand that all the things that are mapped to the same color also correspond to the same line. This is not the case, though.

    mean_value_by_age_group_and_line |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(color = line))
    ## `geom_line()`: Each group consists of only one observation.
    ## ℹ Do you need to adjust the group aesthetic?

    Once again, we are left with an empty chart (albeit with a legend now) and a warning message. To get seperate lines and colors, we have to map to both the color and group aesthetic.

    mean_value_by_age_group_and_line |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = line, color = line))

    Cool beans! That’s exactly what we want. Now what if we had another data set with yet another column that could be used to differentiate lines. Here’s such a data set.

    mean_value_by_age_group_and_line_and_fruit
    ## # A tibble: 105 × 4
    ##    age_group line  fruit   mean_value
    ##    <chr>     <chr> <chr>        <dbl>
    ##  1 >70       E     apricot      1232.
    ##  2 >70       B     apricot      1177.
    ##  3 <18       B     apricot       871.
    ##  4 >70       D     apricot      1208.
    ##  5 <18       B     apple         799.
    ##  6 60 - 70   B     apricot      2249.
    ##  7 >70       D     avocado      1147.
    ##  8 >70       C     apricot      1192.
    ##  9 >70       C     avocado      1194.
    ## 10 18 - 30   E     avocado      1806.
    ## # ℹ 95 more rows

    Notice that there is another column fruit now. If we were to just use our previous code and throw the same ggplot code as before at it, you can probably guess what will happen.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = line, color = line))

    That’s right. We get jagged lines again. Once again, we haven’t told geom_line() how all the groups should be separated and it does its thing to connect all the dots. The same thing happens when we map fruit to the group aesthetic now.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = fruit, color = line))

    < section id="using-interaction-to-get-the-grouping-right" class="level2">

    Using interaction() to get the grouping right

    In this code, we have not carefully separated all the grouping variables and told geom_line() about it. We can do that with help of the interaction() function. If we want to get only one color per letter in the column line we have to leave the color aesthetic as is and tell geom_line() that there is an interaction between the two columns fruit and line.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = interaction(fruit, line), color = line))

    Beautiful! This leaves us still with 5 colors but a separate line for each fruit. In case you’re wondering what interaction does, it’s instructive to just create a new column in the data set. That way, we can look at exactly what interaction() calculates.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        ),
        interaction = interaction(fruit, line)
      )
    ## # A tibble: 105 × 5
    ##    age_group line  fruit   mean_value interaction
    ##    <fct>     <chr> <chr>        <dbl> <fct>      
    ##  1 >70       E     apricot      1232. apricot.E  
    ##  2 >70       B     apricot      1177. apricot.B  
    ##  3 <18       B     apricot       871. apricot.B  
    ##  4 >70       D     apricot      1208. apricot.D  
    ##  5 <18       B     apple         799. apple.B    
    ##  6 60 - 70   B     apricot      2249. apricot.B  
    ##  7 >70       D     avocado      1147. avocado.D  
    ##  8 >70       C     apricot      1192. apricot.C  
    ##  9 >70       C     avocado      1194. avocado.C  
    ## 10 18 - 30   E     avocado      1806. avocado.E  
    ## # ℹ 95 more rows

    As you can see interaction() does nothing too fancy. All it does is that it creates a string that combines the things from the two columns fruit and line. If we wanted to, we could use these strings as color aesthetic. That way we will get one color for each of the combinations.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        ),
        interaction = interaction(fruit, line)
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(color = interaction))
    ## `geom_line()`: Each group consists of only one observation.
    ## ℹ Do you need to adjust the group aesthetic?

    But as always we have to tell geom_line() how to connect points to form a line. In this example, we’d had to additionally map interaction to the group aesthetic.

    mean_value_by_age_group_and_line_and_fruit |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        ),
        interaction = interaction(fruit, line)
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(color = interaction, group = interaction))

    < section id="why-groups-are-useful" class="level2">

    Why groups are useful

    Now, you may wonder why geom_line() is so complicated for this simple thing. The reason is probably best explained with an example. Let’s go back to our first data set with only the columns age_group and mean_value.

    mean_value_by_age_group |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      )
    ## # A tibble: 7 × 2
    ##   age_group mean_value
    ##   <fct>          <dbl>
    ## 1 >70            1187.
    ## 2 <18             811.
    ## 3 60 - 70        2253.
    ## 4 18 - 30        1812.
    ## 5 50 - 60        2463.
    ## 6 40 - 50        2469.
    ## 7 30 - 40        2287.

    Imagine that you want to draw one single line that consists out of two colors. Something like this:

    If the color aesthetic would also determine how things are supposed to be connected, then this would be impossible. After all, we have to color the line at two spatially disconnect positions. Hence, another aesthetic is needed. And that’s why you have group to save the day.

    mean_value_by_age_group |> 
      mutate(
        age_group = factor(
          age_group,
          c('<18', '18 - 30', '30 - 40', '40 - 50', '50 - 60', '60 - 70', '>70')
        )
      ) |> 
      ggplot(aes(x = age_group, y = mean_value)) +
      geom_line(aes(group = '', color = age_group %in% c('<18', '60 - 70',  '>70'))) +
      labs(color = 'Under 18 or over 60')

    < section id="conclusion" class="level2">

    Conclusion

    That’s a wrap! Hope this helped you to finally what the group aesthetic does. And if you found this helpful, here are some other ways I can help you:

    To leave a comment for the author, please follow the link and comment on their blog: Albert Rapp.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Exit mobile version