Conditioning and Grouping with Lattice Graphics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Conditioning and grouping are two important concepts in graphing that allow us to rapidly refine our understanding of data under consideration. Conditioning, in particular, allows us to view relationships across “panels” with common scales. Each panel contains a plot whose data is “conditional” upon records drawn from the category that supports that particular panel (an example will help make this clear). Grouping allows us to view interactions within a given panel by making distinctions using different plot characters, size, or color. These ideas have been discussed in Cleveland’s book “Visualizing Data”. The Lattice, (aka “Trellis”), graphics package attempts to implement the concepts and values indicated in Cleveland’s book. These ideas can help us find interesting patterns in our data.
However, it is important to point out that all graphics packages available within R, (Base, Lattice, ggplot2, Grid), have their respective strengths and good work can be accomplished using any of them although for the newcomer Base graphics is a reasonable starting point. There is an abundance of information on Base graphics both in printed literature and via Google results so ample support is available. As a newcomer to Base graphics it doesn’t take long before you are creating graphs like that in Figure 1, which is an approximation of a conditioned plot. Strictly speaking it does not adhere to the principles of conditioned plots but its useful to see how close we can get. We have a three “panel” plot of MPG vs Weight from the mtcars data frame which is built-in to R. Each panel contains a scatter plot between MPG vs wt for each cylinder group, (4, 6, or 8). The paneling allows us to conveniently compare the MPG vs Wt relationship across groups.
par(mfrow=c(1,3)) cyls <- split(mtcars, mtcars$cyl) for (ii in 1:length(cyls)) { tmpdf <- cyls[[ii]] sname <- names(cyls)[ii] plot(tmpdf$wt, tmpdf$mpg, main = paste("MPG vs Wt",sname,"Cyl"), ylim = c(0,40), xlab = "Wt / 1,000", ylab = "MPG", pch=19, col="blue") grid() }
We see that the higher weights correspond to records with lower a MPG figure. This idea is confirmed by the separate panel plots wherein we see the lower MPG for increasing cylinder count, which is is quite likely correlated with the weight of the automobile. (It is in fact with a correlation coefficient of 0.78). In any case the data in mtcars isn’t terribly difficult to understand, (though that’s not really the point here – we are simply trying to come to terms with how to plot data). We split up the data frame and looped through the resulting list and plotted MPG vs Wt for each subgroup. It is not a bad approach and is one that lends itself well to programming. On the other hand let’s check out how this kind of thing might be done using lattice graphics, which was designed in large part to make conditioned plots really easy. Let’s put it to the test.
library(lattice) xyplot(mpg~wt | factor(cyl), data=mtcars, pch=19, main="MPG vs Wt", xlab="Wt/1,000", ylab="MPG",layout=c(3,1),type=c("p","g"))
What’s Your Condition ?
This isn’t bad for one line (two if you include the library) statement. Let’s break down what is happening in Figure 2. The “mpg~wt” part of the call to xyplot should be familiar as a formula where mpg represents the y-axis and wt represents the x-axis. The formula interface is commonly used in other R procedures such as lm as well as aggregation functions so if this is new to you then it is important to learn about it.
The part of the call to xyplot that newcomers might not understand is the vertical bar character. The presence of this character indicates that what follows is a “conditioning” variable which is typically a category/factor. It also let’s us know that the resulting plot will contain a number of panels the number of which will correspond to the number of unique values assumed by the conditioning variable(s). In the case of mtcars$cyl this will be three since the unique values assumed by cylinder are 4,6,8. However, we really don’t have to know the number of values in advance. We can just plot it and view the defaults after which we can go in and add the “layout” argument to refine the visual presentation.
unique(mtcars$cyl) [1] 6 4 8
But it gets better in that we can use more than one conditioning character at a time. We can look at a combination of factors in effect looking at the MPG vs Wt relationship as “cells” of a table. Sound complicated ? Not really. Check this out. We are asking xyplot to present MPG vs wt for each combination of Transmission and Cylinder type. This results in a 2 x 3 grid structure wherein the panels contain a scatterplot.
xyplot(mpg~wt | factor(cyl) + factor(am,labels=c("A","M")), data=mtcars, main="MPG vs Wt", xlab="Wt/1,000", ylab="MPG",pch=19,type=c("p","g"))
Notice how lattice takes care of the “paneling” for us and supplies labels based on the given factor combinations. So we have a row for each Transmission type, (automatic or manual) each of which contains three columns/panels corresponding to a cylinder group of 4,6, or 8. This plot makes it trivial to determine what combination gets the best gas mileage assuming that is what we are looking for. The 4 Cylinder cars with Manual transmissions get the best gas mileage. Note that we could also further influence the layout and look of the graph and make adjustments to the “strip” labels in each panel but since this is an introductory article we’ll focus on general capabilities for now.
There is Safety in Groups !
Are there other ways that we enhance the graph to reveal more meaning ? In fact there is. In particular we could employ the idea of “grouping”. However, before apply grouping to the above example, (Figure 3), let’s look at a stand-a-lone case to make sure you understand the concept. Suppose we want to look at the MPG vs wt graph within a single panel though we wish to see points colored according to the Transmission group they occupy. This is easy to accomplish with lattice by using the “groups” argument.
xyplot(mpg~wt,data=mtcars,groups=factor(am,labels=c("A","M")), pch=20,auto.key=list(columns=2),type=c("p","g"))
So that was pretty easy wasn’t it ? Instead of viewing the interaction across panels we can see it within a single panel by using the different colors to partition the data into groups. Now let’s say that we wanted the points to reflect the cylinder group they occupy except instead of using color we’ll use point size to indicate the relationship. Here we use a vector from 3 down to 1 in conjunction with the cex argument to reflect the size of the points coming from, respectively, cylinder groups 4, 6, and 8. Points coming from cylinder group 4 are going to be larger than points coming from groups 6 and 8. The MPG is better with 4 cylinder autos and we want that reflected in terms of larger size.
Now if we turn our attention back to the situation wherein we used two factors to condition the MPG vs Wt plots we should be able to combine grouping with conditioning to see if there is anything lurking in the data that we might have missed previously. So what I’ll do here is plot MPG vs Wt as conditioned by Cylinders while grouping by Transmission.
xyplot(mpg ~ wt | factor(cyl), data=mtcars, pch=16,layout=c(3,1), groups = factor(am,labels=c("Auto","Manual")), type=c("p","g"),auto.key = TRUE)
Do we see anything interesting here ? Perhaps. If you have a 4 cylinder car and a manual transmission then you get the best gas mileage. This isn’t anything new. However, if we look at the cars in the 8 cylinder panel we see that cars with a manual transmission, (there are only two of them), don’t seem to perform as well relative to MPG as they did in the 4 cylinder group. In the 6 cylinder panel we see that auto and manual transmissions are closer in MPG than in the 4 cylinder group. So we might form the hypothesis that having a manual transmission is related to significantly better gas mileage only if you have a 4 cylinder car. Or maybe we just need to find more 8 and 6 cylinder cars with manual transmissions to see if the pattern holds. It is important to note that these cars were surveyed in 1974 so our findings would probably change were we to use updated information. Well this is all cool stuff but for now we’ll stick with the graphics details instead of further analysis.
Out of curiosity what would this example look like in Base graphics ? Here is one possibility. I’ll let you make the decision if this is easier than using lattice and if the result is as aesthetically pleasing. Base graphics are amazingly flexible but require more of a programming approach in this case than does lattice although lattice panel functions, (something I will explore in another Blog post), can also require a “programmer’s mentality”.
par(mfrow=c(1,3)) mysplits = split(mtcars,mtcars$cyl) maxmpg = max(mtcars$mpg) for (ii in 1:length(mysplits)) { tmpdf <- mysplits[[ii]] auto <- tmpdf[tmpdf$am == 0,] man <- tmpdf[tmpdf$am == 1,] plot(tmpdf$wt, tmpdf$mpg,type="n", main=paste(names(mysplits[ii])," Cylinders"), ylim=c(0,maxmpg), xlab="wt",ylab="MPG") points(auto$wt,auto$mpg,col="blue",pch=19) points(man$wt,man$mpg,col="pink",pch=19) grid() legend("topright", inset=0.05, c("manual","auto"), pch = 19, col=c("pink","blue")) }
Well I haven’t been completely forthright here. Base graphics does in fact have a function that is dedicated to conditioned plotting although it doesn’t generalize to other plot types. It does, however, handle X/Y plots quite well. I include it here because it does conform to the principles of conditioned plotting but if you are going to embrace the formal approach for conditioning then you might as well use lattice since it does generalize across various plot types.
coplot(mpg~wt|factor(cyl)+factor(am),data=mtcars)
Other Lattice Functions
Okay, you might be wondering if the arguments to xyplot will “carry over” to other lattice plotting functions. The answer is “yes” although they must make sense for the given chart type. For example with histograms or boxplots we are looking at the plot of a single continuous quantity but we can still “condition” that continuous quantity using factors from the dataset. In this next example we look at another internal dataset to R called “ChickWeight”. We’ll present a histogram of the weights of 50 chickens, measured over time, as conditioned by the type of diet they were provided. There are 4 diet types hence we will see 4 panels. All of the considerations important to lattice graphics are intact – common axes, optimal use of space, shared margins, and panels to indicate conditions.
histogram(~weight|factor(Diet),data=ChickWeight, main="Weight by Diet Type",xlab="Weight in grams")
So checkout this example. We create a boxplot of chicken weights conditioned on the passage of time since birth which is measured in days (0,2,…,21).
bwplot(~weight|factor(Time),data=ChickWeight,col="blue", main="Weight by Days Since Birth",xlab="Weight in grams")
Conclusions
So we’ve really only scratched the surface here of what can be done with conditioning and grouping although I feel that this is a reasonable introduction. As I’ve mentioned previously, Base graphics has lots of flexibility and if you become fluent with the high and low level functions offered as part of that environment you can accomplish a lot. Admittedly, if you do not have a programming background then writing looping structures to process data can be a bit intimidating but its worth it to learn. The examples I presented with lattice are to show how convenient it is to do conditioning and grouping but then again if you want to do more customization then that will involve learning a bit more about lattice. So at some point you find yourself in the same position that you would be with Base graphics. It all depends on how detailed and custom you wish your plots to be. The ggplot2 package, which I will discuss in another Blog post, can also handle these activities quite nicely though the jargon is a bit different.
Independently of what graphics package you choose to employ, it is important to realize that flexibility does not always imply ease of use. Like any software package there is absolutely no substitute for experience and getting your hands dirty. R is no exception.
Filed under: Graphics Tagged: lattice graphics, R Programming
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.