Demystifying geom_bar() – How ggplot2 Automatically Counts and Transforms Data

[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


ggplot2 is a powerful and well-known data visualization package for R. But do you know what gg stands for? It actually refers to the Grammar of Graphics, a conceptual framework for understanding and constructing graphs. The core idea behind the Grammar of Graphics is that a plot consists of multiple layers.


The most well-known layers are geometries — the geometric forms that represent data in a plot — and aesthetic mappings, which connect data to specific visual properties. A lesser-known but equally important layer is the statistical layer, which transforms the original data to enable specific types of plots. This may sound complex at first, but it’s actually quite intuitive. In this lesson, we will explore how geom_bar() applies a statistical transformation to make bar plots simpler and more straightforward.


1. How does geom_bar work by default?

To exemplify geom_bar’s default behavior, we will use a dataset about Westminster inquests conducted between 1760 and 1799. These inquests document investigations into deaths that occurred under sudden, unexplained, or suspicious circumstances. To learn more, please visit the project webpage London Lives 1690-1800: Crime, Poverty and Social Policy in the Metropolis.


The first step is to load the data using read_tsv(), a function from the readr package used to read tab-separated values. The verdict variable tells us the conclusion of the investigation, which could be, for example that the death was a homicide or a suicide. To simplify our analysis we unify ‘suicide (delirious)’, ‘suicide (felo de se)’, and ‘suicide (insane)’ into a single category: ‘suicide’. We also filter out observations where the verdict or gender is missing.


content_copy Copy

library(readr)
library(dplyr)
library(ggplot2)

df <- read_tsv("wa_coroners_inquests_v1-1.tsv")

df_prep <- df %>% 
  filter(verdict != "-") %>% 
  filter(gender %in% c("m", "f")) %>% 
  mutate(verdict = recode(verdict, "suicide (delirious)" = "suicide",
                          "suicide (felo de se)" = "suicide",
                          "suicide (insane)" = "suicide"))


Each row of df_prep contains data about the investigation of one death, including the date, gender, and verdict. We would like to have a first overview about the verdicts to determine how many deaths were classified as homicide, suicide, accidental, etc. The default behavior of geom_bar makes it very easy to visualize this information:


content_copy Copy

theme_set(theme_bw()) # chooses a lighter ggplot2 theme: theme_bw()

ggplot(data = df_prep)+
    geom_bar(aes(x=verdict))


geom_bar plot


Why does this work if we mapped a categorical variable to x? Where does ggplot2 get the count for each cause of death? Well, every geometry in ggplot2 has an associated default statistical transformation that tells ggplot whether it should consider the raw input data or whether it should first transform the dataset and then plot it. In the case of geom_bar, the default stat is “count”. That means ggplot will create a second dataframe with the values of verdict and their respective frequency/count, as shown in the figure below.


Statistical transformation in ggplot2


As you can see, ggplot2 does this work for you. But what if your data has already been transformed? In that case, you need to explicitly set geom_bar(aes(x=verdict, y = count), stat = "identity"). If stat is set to “identity”, then ggplot takes the raw input data and does not perform any transformation. In that case, note that an x and y are necessary.


tips_and_updates  
You can use the command `layer_data(plot = last_plot(), i = 1L)` to check out the data ggplot transformed for you. Use this command after the plot command. It will get the transformed data from the last plot, regarding i = 1L, or the first layer of our plot (geom_bar in this case).




2. How to reorder geom_bar?

One improvement we can make to our plot is to reorder the verdicts so that the most frequent one comes first. This can be done with the help of the forcats package. One of its functions, fct_infreq(), reorders a variable based on the frequency of its values (largest first).


content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict)))



reorder geom_bar




3. Stacked and percent stacked geom_bar

Imagine now that you would like to investigate how the verdicts compare across genders, highlighting the cases involving female individuals. This can easily be achieved by mapping gender to the fill aesthetics. The result is two bars on top of each other, one referring to male and other to female.


In the code below, we also make our plot more visually attractive by changing the colors, legend title, and labels. Moreover, we adjust the axis labels.


content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict), fill = gender))+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Number of Cases")


stacked geom_bar


The stacked bar chart above results from the default position = "stack" configuration. To better visualize the distribution of female and male cases for each cause of death (verdict), we can display the percentages instead of absolute counts. This approach makes it easier to see in which verdict category females have a higher proportion. To achieve this, you need to change position to position = "fill" in geom_bar().


content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict), fill = gender), position = "fill")+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Percentage")


percent stacked geom_bar


Now it is clearer that, among all causes of death, homicides have the highest proportion of women. Moreover, the smallest percentage of female cases corresponds to accidental deaths.




4. Use stat_bin to group observations by date

Further examining the data, you want to study how the proportion of suicide cases among women has evolved over time. One way to do this is by filtering only suicide verdicts and visualizing the proportion of female suicide cases across time. Since we have data spanning multiple years, it is a good idea to group them into bins and count the cases within each period. This can be done using stat_bin(), which works similarly to geom_bar() but groups data into bins.


Since our dataset is in a tidy format — where each row represents a single case — we can count the number of occurrences within a specific bin to determine how many cases fall into each time interval. That’s why we set x to doc_date, the date of the investigation. Additionally, we can specify the number of bins by setting a value for the bins parameter. In the code below, we set bins = 10. We also set color = “white” to create white borders around the bars. Apart from these modifications, the code remains the same as in the geom_bar() example above.


content_copy Copy

ggplot(data = df_prep_2)+
    stat_bin(aes(x=doc_date, fill = gender), 
             position = "fill", 
             bins = 10, 
             color = "white")+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Percentage")


percent stacked geom_bar for suicide cases


The plot shows a slight decreasing trend in the proportion of female suicide cases between 1760 and 1800. It also highlights that, throughout the entire period, males accounted for at least 60% of suicide cases.


I would love to hear any feedback or suggestions for improving the plots above. Feel free to share your thoughts or ask any questions in the comments below! Happy coding!




Conclusions

  • geom_bar and stat_bin are powerful tools to depict frequencies of subgroups in your data;
  • The geom_bar stat and position parameters allow users to plot several kinds of bar plots, turning geom_bar into a versatile visualization tool.




**I would like to thank June Choe for this brilliant explanation about stat_layers in ggplot2. Also, thanks a lot to Sharon Howard for preparing this instigating dataset and for making it available.

To leave a comment for the author, please follow the link and comment on their blog: coding-the-past.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)