Site icon R-bloggers

Diamonds and Faceting are a Data Scientist’s best Friends

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last post of this series, we took a first look at strategies for the effective visualization and exploration of data patterns within large data sets. Namely, we examined ways to overcome overplotting, with a focus on a two-dimensional feature space defined by two continuous features. However, oftentimes we want to visualize the distribution of data across several subgroups. For example, subgroups defined by the categories of one feature or even multiple features. As noted in the discussion of overplotting, the mapping of subgroups onto aesthetics, e.g. color, can get quite confusing. This is especially the case for larger numbers of subgroups with overlaying distributions. Therefore, in this blog post we are going to explore a different strategy: faceting.

To do so we are going to use gglot2’s diamonds data set, which comprises information on the price and other features of almost 54,000 diamonds.

In this post we are taking a look at facet_grid and facet_wrap. The basic functionality of both methods is the generation of small multiples of a basic plot for data subsets, defined by the categories of one feature or the unique combinations of categories of several features. The resulting visualizations make it easy to identify similarities or differences in the patterns of the subsets.

rm( list = ls())

library(ggplot2)
library(dplyr)

# Pooling and relabeling some categories for sake of clarity
df_diamonds<- diamonds %>% 
  mutate(color = ifelse(color == "D" | color == "E" | color == "F", 
                        "Colorless", 
                        "Yellowish"),
         clarity = ifelse(clarity == "I1" | clarity == "SI2" | clarity == "SI1", 
                          "Included", 
                          "Nearly Flawless"))

# Generating the base plot
plot_base <- ggplot(data = df_diamonds) +
  geom_histogram(aes(x = price),
                 color = "#A7256A",  
                 fill = "#A7256A",  
                 alpha = 0.7) +
  theme_minimal() +
  labs(x = "Price in $") 

Content and layout of panels

Of course, the most defining parameters of facetted plots are the considered subsets, which for both facet_grid and facet_wrap can be defined via formula notation.

Basically, facet_grid creates the plot versionof a contingency table: a two-dimensional grid of plots. The features mapped on the columns (right) and rows (left), divided by a ~, are to be specified in the facets argument. Rows as well as columns of the grid also can be defined by the combinations of multiple features, which is to be indicated by adding all to be crossed features with a +. If either rows or columns are not specified . is used as placeholder.

# facet_grid: cut in rows
# aligned horizontal scales facilitate comparisons of feature on x-axis
plot_base +
  facet_grid(cut ~ .) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-rows.png", width = 11, height = 5)

# facet_grid: cut in columns
# aligned vertical scales facilitate comparisons of feature on y-axis
plot_base +
  facet_grid(facets = . ~ cut) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-colums.png", width = 11, height = 5)

# facet_grid: cut in columns, color in rows
plot_base +
  facet_grid(facets = color ~ cut) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-colums-and-rows.png", width = 11, height = 5)

# facet_grid: all combinations of cut and clarity in columns, color in rows
plot_base +
  facet_grid(facets = color ~ cut + clarity)+
  ggtitle("Price of Diamonds by Cut, Color and Clarity") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-colums-and-xrows.png", width = 11, height = 5)

Just as in contingency tables, marginal total plots combining all data within a given row or column can be added via the margins argument. If margins is set to TRUE, all marginal total plots are enabled, margins also can be set to a character vector to enable margin plots for all specified variables.

# facet_grid: cut in columns, color in rows, margins for cut
plot_base +
  facet_grid(facets = color ~ cut,
             margins = "cut") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less axis breaks
ggsave("facet-grid-one-margin.png", width = 11, height = 5)

# facet_grid: cut in columns, color in rows, all margins
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE) +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less axis breaks
ggsave("facet-grid-all-margins.png", width = 11, height = 5)

Other than facet_grid, facet_wrap generates a one-dimensional sequence of multiples, which only are arranged two-dimensionally. To accentuate the intrinsic one-dimensionality of the plot sequence, conventionally within the formula specification of the facets argument, the features which are to define the subsets are combined by + and placed behind the ~.

# facet_wrap: one variable 
plot_base +
  facet_wrap(facets =  ~ cut) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-one-var.png", width = 11, height = 5)

# facet_wrap: multiple variables 
plot_base +
  facet_wrap(facets = ~ cut + color) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-multiple-vars.png", width = 11, height = 5)

To arrange the panels most efficiently, the layout of panels is oriented as square as possible. However, the nrow and ncol arguments allow to specify the number of panels within the rows and columns.

# facet_wrap: multiple variables, nrow defined 
plot_base +
  facet_wrap(facets = ~ cut + color,
             nrow = 2) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-nrow.png", width = 11, height = 5)

# facet_wrap: multiple variables, ncol defined 
plot_base +
  facet_wrap(facets = ~ cut + color,
             ncol = 3) +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-ncol.png", width = 11, height = 5)

Further, the overall order of the panels can be defined the dir argument. If the argument is set to v panels are arranged across columns, starting from the top of the most left column. If the argument is set to h panels pertaining are arranged across rows starting on the left-hand side of the first row.

# facet_wrap: dir v 
plot_base +
  facet_wrap(facets = ~ cut ,
             dir = "v") +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-v.png", width = 11, height = 5)

# facet_wrap: dir h 
plot_base +
  facet_wrap(facets = ~ cut ,
             dir = "h") +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-h.png", width = 11, height = 5)

It is way beyond the scope of this post to exhaustively discuss all arguments of facet_grid and facet_wrap, but we briefly take a look at some more “cosmetic” parameters that concern the position of panel labels and the layout of panels themselves.

Within facet_grid, the positon of panel labels canbe controlled via the argument switch. By default, the labels in the columns, respectively rows, are displayed on top respectively right-hand side. When switch is set to x, y, or both column labels are displayed on the bottom, row labels on the left or both, respectively.

# facet_grid: switch x
plot_base +
  facet_grid(facets = color ~ cut,
             switch = "x") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-switch-x.png", width = 11, height = 5)

# facet_grid: switch y
plot_base +
  facet_grid(facets = color ~ cut,
             switch = "y") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-grid-switch-y.png", width = 11, height = 5)

For facet_wrap this can be achieved by setting the argument strip.position to top, bottom, left or right.

# facet_wrap: strip.position left
plot_base +
  facet_wrap(facets = ~ cut + color,
             strip.position = "left") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-strpos-left.png", width = 11, height = 5)

# facet_wrap: strip.position top
plot_base +
  facet_wrap(facets = ~ cut + color,
             strip.position = "bottom") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-strpos-bottom.png", width = 11, height = 5)

Finally, the argument as.table defines the layout of the multiples. For as.table = TRUE the panels pertaining to the highest values or highest ranked categories of the categorizing features are positioned at the bottom right (as in a table), for as.table = FALSE the facets with the highest ranked categories are positioned at the top-right (as in a plot).

# facet_grid: as.table
plot_base +
  facet_grid(facets = cut ~ .,
             as.table = TRUE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-astable.png", width = 11, height = 5)

# facet_grid: not as.table
plot_base +
  facet_grid(facets = cut ~ .,
             as.table = FALSE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-grid-nottable.png", width = 11, height = 5)

# facet_wrap: as.table
plot_base +
  facet_wrap(facets = ~ cut,
             as.table = TRUE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-astable.png", width = 11, height = 5)

# facet_wrap: not as.table
plot_base +
  facet_wrap(facets = ~ cut,
             as.table = FALSE) +
  ggtitle("Price of Diamonds by Cut")
ggsave("facet-wrap-nottable.png", width = 11, height = 5)

Manipulating the scales of panels

Apart from the definition of the contrasted subsets, the probably most important characteristic of facetted plots are the scales of the multiples. By default, the scales of all panels are identical. But depending on the data at hand and the comparison to be made, it might be more insightful to allow some or all scales to vary for the panels, thereby accentuating (smaller) particularities of the considered subsets.

The scales argument of facet_grid and facet_wrap, in combination with the options free, free_x or free_y allows respectively all, the x or the y scales to vary between panels. However, within facet_grid all plots within the columns or rows must have the same y scale respectively x scale, since they share the corresponding axes.

# facet_wrap: free x scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free_x") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free-x.png", width = 11, height = 5)

# facet_wrap: free x scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free_y") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free-y.png", width = 11, height = 5)

# facet_wrap: free x and y scale
plot_base +
  facet_wrap(facets = ~ cut + color,
             scales =  "free") +
  ggtitle("Price of Diamonds by Cut and Color")
ggsave("facet-wrap-free.png", width = 11, height = 5)

# facet_grid: free scales
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE,
             scales = "free") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-scale-free.png", width = 11, height = 5)

While the functionality of the scales argument is constrained, facet_grid offers an additional argument: space. When set to free, the width or height of each column or row vary in proportion to the range of scale of the plot in the respective position.

# facet_grid: free space and free scale
plot_base +
  facet_grid(facets = color ~ cut,
             margins = TRUE,
             scales = "free",
             space = "free") +
  ggtitle("Price of Diamonds by Cut and Color") +
  scale_x_continuous(breaks = c(5000, 15000)) # less x axis breaks
ggsave("facet-grid-scale-space-free.png", width = 11, height = 5)

Faceting can be a powerful tool to facilitate the comparison of patterns within subsets of ones data. Especially since ggplot2 makes facetting so convenient, one should always keep this option in mind.

References

Über den Autor

Lea Waniek

Lea ist Mitglied im Data Science Team und unterstützt ebenfalls im Bereich Statistik.

Der Beitrag Diamonds and Faceting are a Data Scientist's best Friends erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.