Site icon R-bloggers

What is a sunflower plot?

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A sunflower plot is a type of scatterplot which tries to reduce overplotting. When there are multiple points that have the same (x, y) values, sunflower plots plot just one point there, but has little edges (or “petals”) coming out from the point to indicate how many points are really there.

It’s best to see this via an example. Here is a plot of carb vs. gear from the mtcars dataset:

plot(mtcars$gear, mtcars$carb,
     main = "Plot of carb vs. gear")

From the plot it looks like there are only 11 data points. However, if we check the Environments tab in RStudio we see that there are actually 32 observations in the dataset: it’s just that some of the observations have the same (gear, carb) values.

Let’s see how a sunflower plot deals with this overplotting. It turns out that base R comes with a sunflower plot function that does just this:

sunflowerplot(mtcars$gear, mtcars$carb,
              main = "Plot of carb vs. gear")

This tells us, for example, that there are 3 observations with (gear, carb) = (3, 1).

We can change the color of the “petals” by specifying seg.col:

sunflowerplot(mtcars$gear, mtcars$carb,
              seg.col = "blue",
              main = "Plot of carb vs. gear")

By default, the first petal always points up. To have the first petal point in random directions, specify rotate = TRUE:

set.seed(1)
sunflowerplot(mtcars$gear, mtcars$carb,
              rotate = TRUE,
              main = "Plot of carb vs. gear")

When given (x, y) values, sunflowerplot counts the number of times each (x, y) value appears to determine the number of petals it needs to draw. It is possible to override this behavior by passing a number argument, as the next code snippet shows:

set.seed(1)
sunflowerplot(1:10, 1:10, number = 1:10,
              main = "n observations at (n, n)")

Sunflower plots aren’t the only way to reduce overplotting. Another common technique is jittering, where random noise is added to each point. The code below shows how you can do this in base R. Of course, a drawback of this is that the points are not plotted at the exact location of the data.

set.seed(1)
plot(jitter(mtcars$gear), jitter(mtcars$carb),
     main = "Plot of carb vs. gear")

Sunflower plots will not solve all your overplotting issues. Here is an example (on the diamonds dataset) where it does a horrendous job:

library(ggplot2)
data(diamonds)
sunflowerplot(diamonds$carat, diamonds$price,
              main = "Plot of price vs. carat")

A better solution here would be to change the transparency of the points. The code snippet below shows how this can be done in base R:

plot(diamonds$carat, diamonds$price,
     col = rgb(0, 0, 0, alpha = 0.05),
     main = "Plot of price vs. carat")

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.