Site icon R-bloggers

Scaling Density Plots

[This article was first published on R - datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m a density plot devotee. And, using geom_density() from {ggplot2} these plots are effortless to produce. However, sometimes the results of geom_density() are not exactly what I’m after. Here’s how I tweak them to give me precisely what I need.

< !-- ← Don't remove this tag: it's important for banner images. -->

The Data

We’ll use a slightly modified version of the penguins data from the {palmerpenguins} package. The data have been filtered to reduce the number of records for Chinstrap penguins by 50% and the number of records for male penguins (all species) by 75%. The distribution of samples across the species and sex dimensions is now skewed, with male and Chinstrap penguins being relatively scarce.

I have also included data for the recently discovered (and possibly apocryphal) Sparkle penguin species (believed to have been named by a precocious 6 year old with a passion for shiny things and unicorns).

# A tibble: 8 × 3
  species   sex    count
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      18
3 Chinstrap female    17
4 Chinstrap male       4
5 Gentoo    female    58
6 Gentoo    male      15
7 Sparkle   female    80
8 Sparkle   male      10

The total sample count is 275, of which 47 are male and 228 are female.

Sparkle Penguins

Let’s start by focusing our attention on those Sparkle penguins. The data consists of 10 male and 80 female Sparkle penguins. Let’s generate a density plot of flipper length using geom_density().

ggplot(sparkle) +
  geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5)

One of the unique (and remarkable!) characteristics of this species is that the length of their flippers is uniformly distributed between 180 and 230 mm. These bounds are indicated by the vertical dashed lines. The above plot is completely consistent with this: the flipper length density is the same (or at least very similar!) for the two sexes. The difference between the curve for male and female is an artifact of the kernel density estimator used by geom_density(). Since there are more observations of female Sparkle penguins, the distribution of flipper lengths is sharper (closer to square). The area under both curves is 1, which means that each curve can be interpreted as a probability density function (PDF).

But what if we want to actually plot the density of observations (in penguins per mm)? To do this we need to add in a y aesthetic and use the after_stat() function to delay the mapping.

ggplot(sparkle) +
  geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5) +
  facet_grid(sex ~ .)

The shape of the curves remains the same, but now the area under the curves reflects the number of samples for each sex and the height of the curve represents the density of penguins in the sample (in penguins per mm). I’ve split the plot into two facets and overlaid a rug onto each to show the actual distribution of the samples.

So now we have two different views of the data arising from geom_density():

All the Penguins

Let’s broaden our scope and include all of the penguins. First let’s take a look at the vanilla output from geom_density(). This shows us the distribution of flipper length across all species broken down by gender. Each curve gives the appropriate PDF. If we wanted to generate samples of flipper length with the appropriate distribution, then this is the data that we would want.

ggplot(penguins) +
  geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5) +
  facet_grid(sex ~ .)

If, however, we provide a delayed count as the y aesthetic them we get the count density of penguin samples in the data. These curves tell us more about the actual sampled data than they do about the underlying distributions.

ggplot(penguins) +
  geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5)

Penguins on the Ridges

The {ggridges} package includes geoms which provide a complementary view to geom_density() and work particularly well when you need to break the data down into a number of categories. The same two views can be produced here too.

ggplot(penguins) +
  geom_density_ridges(
    aes(x = flipper_length_mm, y = species, fill = sex),
    scale = 1.5,
    alpha = 0.5
  )

Because geom_density_ridges() uses the y aesthetic to determine the ridge offset, we use the height aesthetic to specify the delayed count.

ggplot(penguins) +
  geom_density_ridges(
    aes(x = flipper_length_mm, y = species, fill = sex, height = after_stat(count)),
    stat="density",
    scale = 1.5,
    alpha = 0.5
  )

Something similar could be achieved directly with {ggplot2} by using facets, but I think that ridgeline plots really are 🚀.

To leave a comment for the author, please follow the link and comment on their blog: R - datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.