Site icon R-bloggers

New statistical geoms in {ggxmean}

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a guest post from Gina Reynolds with contributions from 3rd- and 4th-year West Point Math majors Morgan Brown and Madison McGovern. Gina works in data analytics and teaches statistics and probability at West Point. Her work focuses on tools for proximate comparison and translation in data analysis and visualization.

TL;DR

The ggxmean package introduces new geom_*s for fluid visual description of some basic statistical concepts. The ‘titular character’, geom_x_mean, draws a vertical line at the mean of x.

On the Path to ggxmean

A few years ago, I was sitting on the floor of a packed-out ballroom watching Thomas Lin Pederson’s talk, ‘Extend your Ability to Extend ggplot2’.

‘I want to do that,’ I thought.

And I had a use case in mind: statistical summaries, especially those used to explain fundamental statistical concepts like covariance, standard deviation, and correlation.

< !-- https://evamaerey.github.io/statistics/covariance_correlation.html --> You can visually walk through these concepts, dissecting the equations for their computation at a chalkboard. With ggplot2, you can, of course, get this done as well. I put together that walkthrough here:

So, math notation and visual representation builds of basic statistics! They co-evolve speaking to different learning styles. Plus DRY principles for coders and a walkthrough of calc w num vals, for numerophiles! #ggplot2 #xaringan #flipbookr #rstats https://t.co/JgWLxo94Ms pic.twitter.com/ol08lMGdtD

— Gina Reynolds (@EvaMaeRey) June 25, 2020

But to choreograph this, there was a lot of prep that I needed to do before starting to visualize. I had to calculate the means, standard deviations, etc., all before beginning to plot, and then feed those calculations into existing geom_* functions like geom_vline and geom_segment.

This didn’t feel like the powerful declarative experience that you have a lot of the time using ggplot2. Compare that to the experience that you get with the boxplot. That goes something like this:

In this boxplot example, lots of computation happens in the background for us: min, max, 25%, 75%, median. And that is great. I understand the boxplot well; I don’t need to do those computations myself. I’m happy for ggplot2 to do that for me.

For the covariance/variance/correlation stats walkthroughs, I wanted to have the same declarative experience. I understand the mean well, and one standard deviation away from the mean, etc. I should be able to ask ggplot2 to do that computation for me: to compute the global mean (or a group-wise mean if I’m in the mood for that) and put a vertical line there.

My solution to choreographing the stats visualizations with ‘base ggplot2’ (without using the extension mechanisms) felt inelegant and fragile. It wasn’t very portable (not easy to move to other data – maybe data that my students or I might be more passionate about) or dynamic (I couldn’t easily do group-wise work instead of acting globally). It wasn’t much fun.

Thomas’ talk and the extension system seemed like the answer to bringing ggplot2’s fluid feel to these particular statistical stories.

Fast forward a few years. I consulted great materials on extending ggplot2 like the ‘Extending ggplot2’ vignette, the ‘Extension’ chapter in the newest edition of the ggplot2 book; again Thomas Lin Pederson’s talk, ggplot2 code on GitHub, and code from other extension packages in the ggplot2 extension gallery.

Using those resources, I managed to write the geom_x_mean() function and friends. And now I’m happy to introduce the ggxmean package!

I’m excited about these functions because I think the syntax mirrors the chalkboard experience: naming concepts one at a time and easily depicting them.

Moreover, ggxmean allows you to do this visual storytelling beyond what you might do on a chalkboard: port the work routine to other datasets that your students find gripping, work with larger data sets (chalkboard work tends to be super small worked examples), and do group-wise computations!

Regarding this last point, in the plot that follows on the palmerpenguins data, ggplot instantly recomputes everything for us by species when we add the faceting declaration! ggplot2 is hard at work in the background, being its awesome self.1

library(tidyverse)
library(ggxmean)
palmerpenguins::penguins %>% 
  ggplot() +
  aes(x = bill_length_mm) +
  aes(y = flipper_length_mm) +
  geom_point() +
  ggxmean::geom_x_mean() +
  ggxmean::geom_y_mean() +
  ggxmean:::geom_xdiff() +
  ggxmean:::geom_ydiff() +
  ggxmean:::geom_x1sd(linetype = "dashed") +
  ggxmean:::geom_y1sd(linetype = "dashed") +
  ggxmean:::geom_diffsmultiplied() +
  ggxmean:::geom_xydiffsmean(alpha = 1) +
  ggxmean:::geom_rsq1() +
  ggxmean:::geom_corrlabel() +
  facet_wrap(facets = vars(species))

< !--

Excited to be working on a ggplot2 extension package!!! 😮🥳🤯 #ggplot2

{ggxmean} lets you put a vertical line at the mean of x w/ geom_xmean() and do other stuff! #rstats

In action: https://t.co/oxzudwlNXn

Repo: https://t.co/7DTxa7n4ye

Some thoughts in 🧵 pic.twitter.com/vRjXFdmAaQ

— Gina Reynolds (@EvaMaeRey) January 24, 2021
–>

Way Leads Onto Way…

Another set of geoms that ggxmean offers is targeted at another stats intro topic: visualizing discussion of ordinary least squares (OLS) regression. In stats classes across the world, teachers name various statistical concepts as they teach OLS. Again, instructors tend to visualize these with toy datasets on the classroom chalkboard; this is great! ggxmean attempts to isolate some of those concepts and package them into geom_* functions to mirror that chalkboard experience:

library(tidyverse)
library(ggxmean)
#library(transformr) #might help w/ animate

## basic example code
cars %>% 
  ggplot() +
  aes(x = speed,
      y = dist) +
  geom_point() + 
  ggxmean::geom_lm() +
  ggxmean::geom_lm_residuals(linetype = "dashed") +
  ggxmean::geom_lm_fitted(color = "goldenrod3", size = 3) +
  ggxmean::geom_lm_conf_int() +
  ggxmean::geom_lm_pred_int() +
  ggxmean::geom_lm_formula() +
  ggxmean::geom_lm_intercept(color = "red", size = 5) +
  ggxmean::geom_lm_intercept_label(size = 4, hjust = 0)

Extending the Scope of ggxmean: Student Contributions

The work on OLS was a jumping-off point for the most recent functions to the ggxmean package. Morgan Brown and Madison McGovern, students at West Point, contributed to the package for independent studies in the fall AY2022 term. I’m incredibly excited to show you their work.

Morgan and Madison took up the question of data outliers. Here, we apply their work to famous toy datasets: Anscombe’s quartet and the datasauRus Dozen. With the functions I’d worked on, we can visualize the summary statistics (mean, sds, correlation) that are typically the subject of discussions of Anscombe’s quartet and the datasauRus Dozen. This is shown here:

# first some data munging
datasets::anscombe %>%
  pivot_longer(cols = 1:8) %>%
  mutate(group = paste("Anscombe", 
                       str_extract(name, "\\d"))) %>%
  mutate(var = str_extract(name, "\\w")) %>%
  select(-name) %>%
  pivot_wider(names_from = var,
              values_from = value) %>%
  unnest() ->
tidy_anscombe

tidy_anscombe %>%
  ggplot() +
  aes(x = x, y = y) +
  geom_point() +
  aes(color = group) +
  facet_wrap(facets = vars(group)) +
  ggxmean::geom_x_mean() +
  ggxmean::geom_y_mean() +
  ggxmean:::geom_x1sd(linetype = "dashed") +
  ggxmean:::geom_y1sd(linetype = "dashed") +
  ggxmean::geom_lm() +
  ggxmean::geom_lm_formula() +
  ggxmean:::geom_corrlabel() + 
  guides(color = "none")

< !-- This is cool. Usually we are told "and these four datasets all have the same mean, sds, correlation", with the numeric values possibly provided in a nearby table. And now we get to *see* the values right in our plot! -->

But Anscombe and datasauRus constellations are pretty special. And looking at statistics describing outlyingness also makes sense. Using Morgan and Madison’s functions on leverage and influence, we can easily highlight outlying observations!

In the following plot, Morgan’s function geom_text_leverage() calculates leverage for each observation:

tidy_anscombe %>%
  ggplot() +
  aes(x = x, y = y) +
  aes(color = group) +
  geom_point() +
  facet_wrap(facets = vars(group)) +
  ggxmean::geom_text_leverage(vjust = 1,   ## A function Morgan wrote for ggxmean!
                              check_overlap = T) + 
  guides(color = "none")

And in the datasauRus::datasaurus_dozen, Madison’s geom_point_high_cooks() highlights the 10% most influential observations:

datasauRus::datasaurus_dozen %>%
  ggplot() +
  aes(x = x, y = y) +
  geom_point() +
  ggxmean::geom_point_high_cooks( ## A function Madison wrote for ggxmean!
    color = "goldenrod",
    alpha = .5,
    size = 5) + 
  facet_wrap(facets = "dataset")

Using ggxmean

In my day-to-day analytic work, I’m glad to have the ggxmean functions ready to go. The function I use most is, not surprisingly, geom_x_mean() for marking the global and group-wise means! In the classroom, of course, the ggxmean functions are fun to apply to a variety of datasets used in class after a good, old-fashioned chalkboard walkthrough.

The package is not yet on CRAN, so to give it a spin yourself, use:

remotes::install_github("EvaMaeRey/ggxmean")

We’re open to your feedback and contributions on code, computation, and conventions (function names, arguments, etc.)!


  1. Some of these functions aren’t exported because I’m not confident of the names and some other considerations. Consider weighing in on the issues at https://github.com/EvaMaeRey/ggxmean.↩︎

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.