Scatterplot matrices (pair plots) with cdata and ggplot2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my previous post, I showed how to use cdata
package along with ggplot2
‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata
to produce a ggplot2
version of a scatterplot matrix, or pairs plot?
A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs()
function. Here is the base version of the pairs plot of the iris
dataset:
pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])
There are other ways to do this, too:
# not run library(ggplot2) library(GGally) ggpairs(iris, columns=1:4, aes(color=Species)) + ggtitle("Anderson's Iris Data -- 3 species") library(lattice) splom(iris[1:4], groups=iris$Species, main="Anderson's Iris Data -- 3 species")
But I wanted to see if cdata
was up to the task. So here we go….
First, load the packages:
library(ggplot2) library(cdata)
To create the pairs plot in ggplot2
, I need to reshape the data appropriately. For cdata
, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.
Here, I specify the variables I want to plot.
meas_vars <- colnames(iris)[1:4]
The function expand_grid()
returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.
# the data.frame() call strips the attributes from # the frame returned by expand.grid() controlTable <- data.frame(expand.grid(meas_vars, meas_vars, stringsAsFactors = FALSE)) # rename the columns colnames(controlTable) <- c("x", "y") # add the key column controlTable <- cbind( data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]), stringsAsFactors = FALSE), controlTable) controlTable
## pair_key x y ## 1 Sepal.Length Sepal.Length Sepal.Length Sepal.Length ## 2 Sepal.Width Sepal.Length Sepal.Width Sepal.Length ## 3 Petal.Length Sepal.Length Petal.Length Sepal.Length ## 4 Petal.Width Sepal.Length Petal.Width Sepal.Length ## 5 Sepal.Length Sepal.Width Sepal.Length Sepal.Width ## 6 Sepal.Width Sepal.Width Sepal.Width Sepal.Width ## 7 Petal.Length Sepal.Width Petal.Length Sepal.Width ## 8 Petal.Width Sepal.Width Petal.Width Sepal.Width ## 9 Sepal.Length Petal.Length Sepal.Length Petal.Length ## 10 Sepal.Width Petal.Length Sepal.Width Petal.Length ## 11 Petal.Length Petal.Length Petal.Length Petal.Length ## 12 Petal.Width Petal.Length Petal.Width Petal.Length ## 13 Sepal.Length Petal.Width Sepal.Length Petal.Width ## 14 Sepal.Width Petal.Width Sepal.Width Petal.Width ## 15 Petal.Length Petal.Width Petal.Length Petal.Width ## 16 Petal.Width Petal.Width Petal.Width Petal.Width
The control table specifies that for every row of iris
, sixteen new rows get produced, one for each possible pair of variables. The column pair_key
will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x
and y
will be the value columns in the new data frame — these will be the columns that we plot.
Now I can create the new data frame, using rowrecs_to_blocks()
. I’ll also carry along the Species
column to color the points in the plot.
iris_aug = rowrecs_to_blocks( iris, controlTable, columnsToCopy = "Species") head(iris_aug)
## Species pair_key x y ## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 ## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 ## 3 setosa Petal.Length Sepal.Length 1.4 5.1 ## 4 setosa Petal.Width Sepal.Length 0.2 5.1 ## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 ## 6 setosa Sepal.Width Sepal.Width 3.5 3.5
Note that the data is now sixteen times larger, which I admit is perverse.
If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y
vs x
, and facet_wrap
on pair_key
. But I want the subplots arranged in a grid. To do this I use facet_grid
, which will require two key columns:
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE) iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1)) iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1)) head(iris_aug)
## Species pair_key x y xv yv ## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length ## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 Sepal.Width Sepal.Length ## 3 setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length ## 4 setosa Petal.Width Sepal.Length 0.2 5.1 Petal.Width Sepal.Length ## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length Sepal.Width ## 6 setosa Sepal.Width Sepal.Width 3.5 3.5 Sepal.Width Sepal.Width
And now I can produce the graph, using facet_grid
.
# reorder the key columns to be the same order # as the base version above iris_aug$xv <- factor(as.character(iris_aug$xv), meas_vars) iris_aug$yv <- factor(as.character(iris_aug$yv), meas_vars) ggplot(iris_aug, aes(x=x, y=y)) + geom_point(aes(color=Species, shape=Species)) + facet_grid(yv~xv, labeller = label_both, scale = "free") + ggtitle("Anderson's Iris Data -- 3 species") + scale_color_brewer(palette = "Dark2") + ylab(NULL) + xlab(NULL)
This pair plot has x = y
plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs()
.
Of course, calling pairs()
(or ggpairs()
, or splom()
) is a lot easier than all this, but now I’ve proven to myself that cdata
with ggplot2
can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs()
— I prefer the Brewer Dark2 palette, myself.
Luckily, this code is straightforward to wrap as a function, so if you like the cdata
version, I’ve now added the PairPlot()
function to WVPlots
. Now it’s a one-liner, too.
library(WVPlots) PairPlot(iris, colnames(iris)[1:4], "Anderson's Iris Data -- 3 species", group_var = "Species")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.