Site icon R-bloggers

Alluvial diagrams

[This article was first published on Brokering Closure » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Parallel coordinates plot is one of the tools for visualizing multivariate data. Every observation in a dataset is represented with a polyline that crosses a set of parallel axes corresponding to variables in the dataset. You can create such plots in R using a function parcoord in package MASS. For example, we can create such plot for the built-in dataset mtcars:

library(MASS)
library(colorRamps)
 
data(mtcars)
k <- blue2red(100)
x <- cut( mtcars$mpg, 100)
 
op <- par(mar=c(3, rep(.1, 3)))
parcoord(mtcars, col=k[as.numeric(x)])
par(op)

This produces the plot below. The lines are colored using a blue-to-red color ramp according to the miles-per-gallon variable.

What to do if some of the variables are categorical? One approach is to use polylines with different width. Another approach is to add some random noise (jitter) to the values. Titanic data is a crossclassification of Titanic passengers according to class, gender, age, and survival status (survived or not). Consequently, all variables are categorical. Let’s try the jittering approach. After converting the crossclassification (R table) to data frame we “blow it up” by repeating observations according to their frequency in the table.

data(Titanic)
# convert to data frame of numeric variables
titdf <- as.data.frame(lapply(as.data.frame(Titanic), as.numeric))
# repeat obs. according to their frequency
titdf2 <- titdf[ rep(1:nrow(titdf), titdf$Freq) , ]
# new columns with jittered values
titdf2[,6:9] <- lapply(titdf2[,1:4], jitter)
# colors according to survival status, with some transparency
k <- adjustcolor(brewer.pal(3, "Set1")[titdf2$Survived], alpha=.2)
op <- par(mar=c(3, 1, 1, 1))
parcoord(titdf2[,6:9], col=k)
par(op)

This produces the following (red lines are for passengers who did not survive):

It is not so easy to read, is it. Did the majority of 1st class passengers (bottom category on leftmost axis) survived or not? Definitely most of women from that class did, but in aggregate?

At this point it would be nice to, instead of drawing a bunch of lines, to draw segments for different groups of passengers. Later I learned that such plot exists and even has a name: alluvial diagram. They seem to be related to Sankey diagrams blogged about on R-bloggers recently, e.g. here. What is more, I was not alone in thinking how to create such a thing with R, see for example here. Later I found that what I need is a “parallel set” plot, as it was called, and implemented, on CrossValidated here. Thats look terrific to me, nevertheless, I still would prefer to:

And so I wrote a prototype function alluvial (tadaaa!), now in a package alluvial on Github. I strongy relied on code by Aaron from his answer on CrossValidated (hat tip).

See the following examples of using alluvial on Titanic data:

First, just using two variables Class and Survival, and with stripes being simple polygons.

This was produced with the code below.

# load packages and prepare data
library(alluvial)
tit <- as.data.frame(Titanic)
 
# only two variables: class and survival status
tit2d <- aggregate( Freq ~ Class + Survived, data=tit, sum)
 
alluvial( tit2d[,1:2], freq=tit2d$Freq, xw=0.0, alpha=0.8,
         gap.width=0.1, col= "steelblue", border="white",
         layer = tit2d$Survived != "Yes" )

The function accepts data as (collection of) vectors or data frames. The xw argument specifies the position of the knots of xspline relative to the axes. If positive, the knot is further away from the axis, which will make the stripes go horizontal longer before turning towards the other axis. Argument gap.width specifies distances between categories on the axes.

Another example is showing the whole Titanic data. Red stripes for those who did not survive.

Now its possible to see that, e.g.:

The plot was produced with:

alluvial( tit[,1:4], freq=tit$Freq, border=NA,
         hide = tit$Freq < quantile(tit$Freq, .50),
         col=ifelse( tit$Survived == "No", "red", "gray") )

In this variant the stripes have no borders, color transparency is at 0.5, and for the purpose of the example the plot shows only “thickest” 50% of the stripes (argument hide).

As compared to the parallel set solution mentioned earlier, the main differences are:

If you have suggestions or ideas for extensions/modifications, let me know on Github!

Stay tuned for more examples from panel data.

To leave a comment for the author, please follow the link and comment on their blog: Brokering Closure » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.