Site icon R-bloggers

How to Label Outliers in Boxplots in ggplot2?

[This article was first published on Data Science Tutorials, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to Label Outliers in Boxplots in ggplot2? appeared first on Data Science Tutorials

How to Label Outliers in Boxplots in ggplot2, This article offers a detailed illustration of how to name outliers in ggplot2 boxplots.

Step 1: Construct the data frame.

Create the following data frame first, which will include details on the 60 distinct basketball players who played for three separate teams and scored points.

How to add labels at the end of each line in ggplot2? (datasciencetut.com)

Make this illustration repeatable.

set.seed(123)

Now we can create a data frame

df <- data.frame(team=rep(c('A', 'B', 'C'), each=20),
                 player=rep(LETTERS[1:20], times=3),
                 points=round(rnorm(n=60, mean=30, sd=10), 2))

Let’s view the head of the data frame

head(df)
   team player points
1    A      A  37.84
2    A      B  42.60
3    A      C  40.96
4    A      D   5.78
5    A      E  37.65
6    A      F  24.98

Step 2: Define a Function to Identify Outliers

An observation is considered an outlier in ggplot2 if it satisfies any of the following two criteria:

The observation falls within the first quartile by 1.5 times the interquartile range (Q1)

The observation exceeds the third quartile by 1.5 times the interquartile range (Q3).

If an observation satisfies any of these two criteria, we can build the following function in the R language to classify it as an outlier.

Change ggplot2 Theme Color in R- Data Science Tutorials

findoutlier <- function(x) {
  return(x < quantile(x, .25) - 1.5*IQR(x) | x > quantile(x, .75) + 1.5*IQR(x))
}

Step 3: In ggplot2, label outliers in boxplots

The next step is to use the code below to label outliers in ggplot2 boxplots:

library(ggplot2)
library(dplyr)

to the data frame, including a new column that shows if each observation is an outlier.

df <- df %>%
        group_by(team) %>%
        mutate(outlier = ifelse(findoutlier(points), points, NA))

Now we can create a box plot of points by team and label outliers

ggplot(df, aes(x=team, y=points)) +
  geom_boxplot() +
  geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)

Please take note that we may alternatively classify these outliers using a different variable.

To label the outliers based on the player name instead, we could, for instance, switch out points for players in the mutate() function.

library(ggplot2)
library(dplyr)
df <- df %>%
        group_by(team) %>%
        mutate(outlier = ifelse(findoutlier(points), player, NA))

build a box plot of the team’s points and identify outliers.

Best GGPlot Themes You Should Know – Data Science Tutorials

ggplot(df, aes(x=team, y=points)) +
  geom_boxplot() +
  geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)

The outlier on team A now has a label of N and the outlier on team B now has a label of D, since these represent the player names who have outlier values for points.

Check your inbox or spam folder to confirm your subscription.

The post How to Label Outliers in Boxplots in ggplot2? appeared first on Data Science Tutorials

To leave a comment for the author, please follow the link and comment on their blog: Data Science Tutorials.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.