Site icon R-bloggers

Box Plot Alternatives: Beeswarm and Violin Plots

[This article was first published on R on datascienceblog.net: R for Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Box plots are great as they do not only indicate the median value but also show the variation of the measurements in terms of the 1st and 3rd quartiles. There are, however, also plots that provide a bit of additional information. Here, we take a closer look at potential alternatives to the box plot: the beeswarm and the violin plot.

The beeswarm plot

An implementation of the beeswarm plot is available via the beeswarm package. In principle, the beeswarm plot resembles a one-dimensional scatter plot because it displays individual measurements as points. The difference, however, is that the beeswarm plot applies a logic ensuring that plotted points are close to each other and do not overlap.

So, when should you use a beeswarm plot? Since the points in a beeswarm plot should not overlap, this type of plot is only suitable for a relatively small number of measurements. They are also suitable if measurements exhibit different groups because individual points can be colored accordingly.

Using the beeswarm plot in R

We will use the warpbreaks data set to exemplify the use of the plot.

data(warpbreaks)
library(beeswarm)
# option 1: color by group
beeswarm(breaks ~ wool + tension, data = warpbreaks,
    main = "Beeswarm of breaks versus wool and tension",
    col = rep(c("red", "blue"), 3))

# option 2: color individual points
beeswarm(breaks ~ wool, data = warpbreaks,
    main = "Breaks versus wool",
    pwcol = as.numeric(tension), pch = 16)
legend("topright", legend = levels(warpbreaks$tension),
       title = "Tension", pch = 16, col = 1:3)

The beeswarm plot can also be combined with a box plot in the following way:

boxplot(breaks ~ wool, data = warpbreaks, 
        main = "Boxplot with beeswarm",
    # do not duplicate outliers:
    outline = FALSE)
beeswarm(breaks ~ wool, data = warpbreaks,
    main = "Beeswarm of breaks versus wool", add = TRUE,
    pwcol = as.numeric(tension), pch = 16)
legend("topright", legend = levels(warpbreaks$tension),
       title = "Tension", pch = 16, col = 1:3)

The violin plot

The idea of a violin plot is to combine a box plot with a density plot. Since it relies on density estimation, the plot only makes sense if a sufficient number of data are available for obtaining reliable estimates. Otherwise, the estimated densities may indicate trends that are not really in the data.

Creating a violin plot in R

To demonstrate how a violin plot is different from a box plot, compare the following two representations:

library(ggplot2)
require(gridExtra)
p.violin <- ggplot(warpbreaks, aes(x = wool, y = breaks)) +
    # add horizontal lines at Q1, Q2, and Q3
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
    ggtitle("Violin plot")
p.box <- ggplot(warpbreaks, aes(x = wool, y = breaks)) +
    geom_boxplot() + ggtitle ("Box plot")
grid.arrange(p.violin, p.box, ncol = 2)

In this case, we see the limitation of the violin plot for small sample sizes (hint: the limitation is not that the plot does not seem to show violins but vases). The violin for wool A stretches up to the outliers at a value of 65 indicating. Since the width is similar at values 40 and 60, one could think that there are many such measurements. The box plot, on the other hand, reveals that there are indeed only two measurements with a value greater than 60.

To leave a comment for the author, please follow the link and comment on their blog: R on datascienceblog.net: R for Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.