Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I present a function that helps to label outlier observations When plotting a boxplot using R.
An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).
Identifying these points in R is very simply when dealing with only one boxplot and a few outliers. That can easily be done using the “identify” function in R. For example, running the code bellow will plot a boxplot of a hundred observation sampled from a normal distribution, and will then enable you to pick the outlier point and have it’s label (in this case, that number id) plotted beside the point:
1 2 3 4 | set.seed(482) y <- rnorm(100) boxplot(y) identify(rep(1, length(y)), y, labels = seq_along(y)) |
However, this solution is not scalable when dealing with:
- Many outliers
- Overlapping data-points, and
- Multiple boxplots in the same graphic window
For such cases I recently wrote the function “boxplot.with.outlier.label” (which you can download from here). This function will plot operates in a similar way as “boxplot” (formula) does, with the added option of defining “label_name”. When outliers are presented, the function will then progress to mark all the outliers using the label_name variable. This function can handle interaction terms and will also try to space the labels so that they won’t overlap (my thanks goes to Greg Snow for his function “spread.labs” from the {TeachingDemos} package, and helpful comments in the R-help mailing list).
Here is some example code you can try out for yourself:
1 2 3 4 5 6 7 8 9 | source("http://www.r-statistics.com/wp-content/uploads/2011/01/boxplot-with-outlier-label-r.txt") # Load the function # sample some points and labels for us: set.seed(492) y <- rnorm(2000) x1 <- sample(letters[1:2], 2000,T) x2 <- sample(letters[1:2], 2000,T) lab_y <- sample(letters[1:4], 2000,T) # plot a boxplot with interactions: boxplot.with.outlier.label(y~x2*x1, lab_y) |
You can also have a try and run the following code to see how it handles simpler cases:
1 2 3 4 5 | # plot a boxplot without interactions: boxplot.with.outlier.label(y~x1, lab_y, ylim = c(-5,5)) # plot a boxplot of y only boxplot.with.outlier.label(y, lab_y, ylim = c(-5,5)) boxplot.with.outlier.label(y, lab_y, spread_text = F) # here the labels will overlap (because I turned spread_text off) |
Here is the output of the last example, showing how the plot looks when we allow for the text to overlap.
Regarding package dependencies: notice that this function requires you to first install the packages {TeachingDemos} (by Greg Snow) and {plyr} (by Hadley Wickham)
Updates:
- 19.04.2011 – I’ve added support to the boxplot “names” and “at” parameters.
- 31.10.2011 – I’ve fixed a bug report (my thanks goes to Josh O’Brien for the heads up). There is now also support for two arguments allowing to easily change the distance of the labels/segments from the outliers.
You are very much invited to leave your comments if you find a bug, think of ways to improve the function, or simply enjoyed it and would like to share it with me.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.