Fix Overplotting with Colored Contour Lines
[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I saw this plot in the supplement of a recent paper comparing microarray results to RNA-seq results. Nothing earth-shattering in the paper – you’ve probably seen a similar comparison many times before – but I liked how they solved the overplotting problem using heat-colored contour lines to indicate density. I asked how to reproduce this figure using R on Stack Exchange, and my question was quickly answered by Christophe Lalanne.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here’s the R code to generate the data and all the figures here.
Here’s the problem: there are 50,000 points in this plot causing extreme overplotting. (This is a simple multivariate normal distribution, but if the distribution were more complex, overplotting might obscure a relationship in the data that you didn’t know about).
I liked the solution they used in the paper referenced above. Contour lines were placed throughout the data indicating the density of the data in that region. Further, the contour lines were “heat” colored from blue to red, indicating increasing data density. Optionally, you can add vertical and horizontal lines that intersect the means, and a legend that includes the absolute correlation coefficient between the two variables.
There are many other ways to solve an overplotting problem – reducing the size of the points, making points transparent, using hex-binning.
Using a single pixel for each data point:
Using hexagonal binning to display density (hexbin package):
Finally, using semi-transparency (10% opacity; easiest using the ggplot2 package):
Edit July 7, 2012 – From Pete’s comment below, the smoothScatter() function in the build in graphics package produces a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. You can change the colors using the colramp option, and change how many outliers are plotted with the nrpoints option. Here, 100 outliers are plotted as single black pixels – outliers here being points in the areas of lowest regional density.
How do you deal with overplotting when you have many points?
To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.