Row-wise summary curves in faceted ggplot2 figures

brobar

9 years ago

[This article was first published on The Geokook. » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I really enjoy reading the Junk Charts blog. A recent post made me wonder how easy it would be to add summary curves for small-multiple type plots, assuming the “small multiples” to summarize were the X component of a ggplot2::facet_grid(Y ~ X) layer. In other words, how could I plot the same summary curve across each row of the faceted plot?

First we need some data. I have been working on a spectrum estimation tool with Robert Parker, and ran some benchmark tests of the core function against a function with similar functionality, namely spec.mtm in the multitaper package.

The benchmarking was done using the rbenchmark package. In short, I generate an auto-regressive simulation using arima.sim(list(order = c(1,1,0), ar = 0.9), n), and then benchmark the functions for incremental increases in n (the length of the simulated set); here is the resulting information as an R-data file. (I’m not showing the code used to produce the data, but if you’re curious I’ll happily provide it.)

With a bit of thought (and trial-and-effort for me), I found Hadley’s reshape2 and plyr packages made it straightforward to calculate the group statistics (note some prior steps are skipped for brevity, but the full code is linked at the end):

## reduce data.frame with melt
allbench.df.mlt <- reshape2::melt(allbench.df.drp, id.vars=c("test","num_terms"))

## calculate the summary information to be plotted:
## 'value' can be anything, but here we use meadian values from Hmisc::smean.cl.normal, which calculates confidence limits using a t-test
## 'summary' is not important for plotting -- it's just a name
tmpd <- plyr::ddply(allbench.df.mlt, .(variable, num_terms), summarise, summary="medians", value=mean_cl_normal(value)[1,1])

## create copies of 'tmpd' for each test, and map them to one data.frame
tests <- unique(allbench.df$test)
allmeds <- plyr::ldply(lapply(X=tests, FUN=function(x,df=tmpd){df$test <- x; return(df)}))

Here’s the final result, after adding a ggplot2::geom_line layer with the allmeds data frame to the faceted plot:

This type of visualization helps visually identify differences among subsets of data. Here, the lines help distinguish the benchmark information by method (facet columns). Of course the stability of benchmark data depends on the number replications, but here we can see the general shape of the user.self and elapsed times are consistent across the three methods, and that the rlpSpec methods consume less sys.self time with increasing series length. Most surprising to me is the convergence of relative times with increasing series length. When the number of terms is more than approximately 5000, the methods have roughly equal performance; below this threshold the spec.mtm method can be upwards of 2-3 times faster, which should not be too surprising given that it calls Fortran source code.

I assume there is a slick way to do this with ggplot2::stat_summary, but I was scratching my head trying to figure it out. Any insight into a better or easier way to do this is especially welcome!

Here is the code to produce the figure, as a gist. If you have any troubles accessing the data, please let me know.

To leave a comment for the author, please follow the link and comment on their blog: The Geokook. » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.