Scatterplot matrices (pair plots) with cdata and ggplot2

Nina Zumel

3 years ago

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs() function. Here is the base version of the pairs plot of the iris dataset:

pairs(iris[1:4], 
      main = "Anderson's Iris Data -- 3 species",
      pch = 21, 
      bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])

There are other ways to do this, too:

# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) + 
  ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4], 
      groups=iris$Species, 
      main="Anderson's Iris Data -- 3 species")

But I wanted to see if cdata was up to the task. So here we go….

First, load the packages:

library(ggplot2)
library(cdata)

To create the pairs plot in ggplot2, I need to reshape the data appropriately. For cdata, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.

Here, I specify the variables I want to plot.

meas_vars <- colnames(iris)[1:4]

The function expand_grid() returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.

# the data.frame() call strips the attributes from
# the frame returned by expand.grid()
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
                                       stringsAsFactors = FALSE))
# rename the columns
colnames(controlTable) <- c("x", "y")

# add the key column
controlTable <- cbind(
  data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
             stringsAsFactors = FALSE),
  controlTable)

controlTable

##                     pair_key            x            y
## 1  Sepal.Length Sepal.Length Sepal.Length Sepal.Length
## 2   Sepal.Width Sepal.Length  Sepal.Width Sepal.Length
## 3  Petal.Length Sepal.Length Petal.Length Sepal.Length
## 4   Petal.Width Sepal.Length  Petal.Width Sepal.Length
## 5   Sepal.Length Sepal.Width Sepal.Length  Sepal.Width
## 6    Sepal.Width Sepal.Width  Sepal.Width  Sepal.Width
## 7   Petal.Length Sepal.Width Petal.Length  Sepal.Width
## 8    Petal.Width Sepal.Width  Petal.Width  Sepal.Width
## 9  Sepal.Length Petal.Length Sepal.Length Petal.Length
## 10  Sepal.Width Petal.Length  Sepal.Width Petal.Length
## 11 Petal.Length Petal.Length Petal.Length Petal.Length
## 12  Petal.Width Petal.Length  Petal.Width Petal.Length
## 13  Sepal.Length Petal.Width Sepal.Length  Petal.Width
## 14   Sepal.Width Petal.Width  Sepal.Width  Petal.Width
## 15  Petal.Length Petal.Width Petal.Length  Petal.Width
## 16   Petal.Width Petal.Width  Petal.Width  Petal.Width

The control table specifies that for every row of iris, sixteen new rows get produced, one for each possible pair of variables. The column pair_key will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x and y will be the value columns in the new data frame — these will be the columns that we plot.

Now I can create the new data frame, using rowrecs_to_blocks(). I’ll also carry along the Species column to color the points in the plot.

iris_aug = rowrecs_to_blocks(
  iris,
  controlTable,
  columnsToCopy = "Species")

head(iris_aug)

##   Species                  pair_key   x   y
## 1  setosa Sepal.Length Sepal.Length 5.1 5.1
## 2  setosa  Sepal.Width Sepal.Length 3.5 5.1
## 3  setosa Petal.Length Sepal.Length 1.4 5.1
## 4  setosa  Petal.Width Sepal.Length 0.2 5.1
## 5  setosa  Sepal.Length Sepal.Width 5.1 3.5
## 6  setosa   Sepal.Width Sepal.Width 3.5 3.5

Note that the data is now sixteen times larger, which I admit is perverse.

If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y vs x, and facet_wrap on pair_key. But I want the subplots arranged in a grid. To do this I use facet_grid, which will require two key columns:

splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1))
head(iris_aug)

##   Species                  pair_key   x   y           xv           yv
## 1  setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length
## 2  setosa  Sepal.Width Sepal.Length 3.5 5.1  Sepal.Width Sepal.Length
## 3  setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length
## 4  setosa  Petal.Width Sepal.Length 0.2 5.1  Petal.Width Sepal.Length
## 5  setosa  Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length  Sepal.Width
## 6  setosa   Sepal.Width Sepal.Width 3.5 3.5  Sepal.Width  Sepal.Width

And now I can produce the graph, using facet_grid.

# reorder the key columns to be the same order
# as the base version above
iris_aug$xv <- factor(as.character(iris_aug$xv),
                           meas_vars)
iris_aug$yv <- factor(as.character(iris_aug$yv),
                           meas_vars)


ggplot(iris_aug, aes(x=x, y=y)) +
  geom_point(aes(color=Species, shape=Species)) + 
  facet_grid(yv~xv, labeller = label_both, scale = "free") +
  ggtitle("Anderson's Iris Data -- 3 species") +
  scale_color_brewer(palette = "Dark2") +
  ylab(NULL) + 
  xlab(NULL)

This pair plot has x = y plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs().

Of course, calling pairs() (or ggpairs(), or splom()) is a lot easier than all this, but now I’ve proven to myself that cdata with ggplot2 can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs() — I prefer the Brewer Dark2 palette, myself.

Luckily, this code is straightforward to wrap as a function, so if you like the cdata version, I’ve now added the PairPlot() function to WVPlots. Now it’s a one-liner, too.

library(WVPlots) 

PairPlot(iris, 
         colnames(iris)[1:4], 
         "Anderson's Iris Data -- 3 species", 
         group_var = "Species")

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.