Site icon R-bloggers

Relearn boxplot and label the outliers

[This article was first published on StaTEAstics., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html>< !-- saved from url=(0014)about:internet --> < !-- Styles for R syntax highlighter --> < !-- R syntax highlighter --> < !-- MathJax scripts --> Despite the fact that box plot is used almost every where and taught at undergraduate statistic classes, I recently had to re-learn the box plot in order to know how to label the outliers.

This stackoverflow post was where I found how the outliers and whiskers of the Tukey box plots are defined in R and ggplot2:
In ggplot2, what do the end of the boxplot lines represent?

and this post on how to label the outliers using base graphics.
How to label all the outliers in a boxplot

Since the use of ggplot2 is required for this task, I have written some basic hack code to label the outliers for ggplot2.


Here are the codes:
## Install the FAOSTAT package to obtain the data

if(!is.element("FAOSTAT", .packages()))
    install.packages("FAOSTAT")
library(FAOSTAT)


## Download data on Cassava production
cp.lst = getFAOtoSYB(name = "cassava_production", domainCode = "QC",
    itemCode = 125, elementCode = 5510)


## Use the country level data, and take only data for 2011 and remove the NA's
cp.df = cp.lst$entity[!is.na(cp.lst$entity$cassava_production) &
                      cp.lst$entity$Year == 2011, ]

## Merge with the country profile to obtain the country names for labelling
ccp.df = merge(cp.df, FAOcountryProfile[, c("FAOST_CODE", "ABBR_FAO_NAME")],
    all.x = TRUE)


## Merge with the regional pofile to obtain the UNSD M49 macro region
## composition for multiple boxplot.
rcp.df = merge(ccp.df, FAOregionProfile[, c("FAOST_CODE", "UNSD_MACRO_REG")],
    all.x = TRUE)


## Compute the quantile
qrcp.df = ddply(.data = rcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lQntl = quantile(cassava_production, probs = 0.25, na.rm = TRUE),
    uQntl = quantile(cassava_production, probs = 0.75, na.rm = TRUE))


## Compute the lower and upper bound which defines the outlier
brcp.df = ddply(.data = qrcp.df, .variables = .(UNSD_MACRO_REG), transform,
    lBound = lQntl - 1.5 * (uQntl - lQntl),
    uBound = uQntl + 1.5 * (uQntl - lQntl))



## Remove the country names if it is within the bounds
with(brcp.df, {
    brcp.df[cassava_production <= uBound &
            cassava_production >= lBound, "ABBR_FAO_NAME"] <<- NA
})

## Plot the data
set.seed(587)
ggplot(data = brcp.df, aes(x = UNSD_MACRO_REG, y = cassava_production)) +
    geom_boxplot(outlier.colour = NA) +
    geom_text(aes(label = ABBR_FAO_NAME), size = 2,
              position = position_jitter(width = 0.1)) +
    labs(x = NULL, y = NULL, title = "Production of Cassava by region")


Here is the final product, to avoid over-plotting of texts I have used position_jitter. Which is not an elegant solution but I just can not find any algorithm that works well in general.




To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.