Outlier Detection with Mahalanobis Distance

Steffen

5 years ago

[This article was first published on R – Steffen Ruefer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this tutorial I will discuss how to detect outliers in a multivariate dataset without using the response variable. I will first discuss about outlier detection through threshold setting, then about using Mahalanobis Distance instead.

The Problem

Outliers are data points that do not match the general character of the dataset. Often they are extreme values that fall outside of the “normal” range; one way of dealing with such values is to take out the highest and the lowest values of a variable. This can work quite well, but does not take into account variable combinations.

What is the problem with having outliers in the data? Sometimes there is no problem with it at all, in fact, outliers can be beneficial to understanding special characteristics of the data. In other cases, outliers might be simply mistakes in the data (i.e. noise); if you do not identify them, your predictive model will be less accurate in making predictions.

A Simple Example

The Height-Weight Dataset

To illustrate I will use a sample dataset containing height and weight data of male adults. Below is the scatter plot showing height vs weight. The data points appear normal distributed and there are some extreme values visible that are not part of the “cloud” of points around the center.

Remove Outliers with Feature Thresholds

As mentioned already, one way to deal with outliers is to set minimum and maximum thresholds to mark outliers. In this case, after visual inspection, I set the following limits (for example purpose only – no science involved!):

height outliers: above 187 cm or below 160 cm
weight outliers: above 72 kg or below 41 kg

Notice that many outliers were detected by using thresholds – however, most of them are closer to the regression line than the outliers that were missed.

Why should the “missed outliers” be outliers in the first place? From a model perspective, they are far from the regression line, which means such outliers will cause larger errors.
But it is more interesting to interpret these data points directly. Being a tall person does not make someone an exception – but being tall and having a very low weight does! One of the marked data points represents a person that is approx 1.85 m (6 ft 8 inches) tall, with a body weight of only 45 kg (99 pounds). Clearly this person is seriously under weight, and yet it slipped through the detection threshold.

Use Mahalanobis Distance

The Mahalanobis distance is a measure of the distance between a point P and a distribution D, as explained here. I will not go into details as there are many related articles that explain more about it. I will only implement it and show how it detects outliers. The complete source code in R can be found on my GitHub page.

# Calculate Mahalanobis Distance with height and weight distributions
m_dist <- mahalanobis(df[, 1:2], colMeans(df[, 1:2]), cov(df[, 1:2]))
df$m_dist <- round(m_dist, 2)

# Mahalanobis Outliers - Threshold set to 12
df$outlier_maha <- "No"
df$outlier_maha[df$m_dist > 12] <- "Yes"

# Scatterplot with Maha Outliers
ggplot(df, aes(x = weight, y = height, color = outlier_maha)) +
      geom_point(size = 5, alpha = 0.6) +
      labs(title = "Weight vs Height",
           subtitle = "Outlier Detection in weight vs height data - Using Mahalanobis Distances",
           caption = "Source: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights") +
      ylab("Height in cm") + xlab("Weight in kg") +
      scale_y_continuous(breaks = seq(160, 200, 5)) +
      scale_x_continuous(breaks = seq(35, 80, 5))

The scatterplot shows that all previously missed outliers were detected this time, and plenty of single feature “extreme” values were not declared as outliers.

Some data points that were previously also found as being outliers were still detected though – the ones on the very far end of the scale. While these might not have a large effect on the regression model, they are still outliers. Very tall or very short people, even without being over or under weight, are still rare and therefore fall into this category.

In this dataset I was using the response variable to detect outliers – this is usually not the case. In the next example I will use a dataset with more variables and try to detect outliers by using only the available predictor variables.

A Multivariate Example

The Housing Dataset

I will use a simplified version of the housing dataset, provided by Kaggle. The original data contains 80 predictor variables, but for the purpose of this tutorial I will reduce it to 4 predictor variables and the response variable:

Response Variable: SalePrice
Predictor Variables: GrLivArea, GarageYrBuilt, LotArea, LotFrontage

# Housing Dataset
df <- read.csv("train.csv")

# Select only 5 features - SalePrice is the response variable
df <- df %>%
      select(SalePrice, GrLivArea, GarageYrBlt, LotArea, LotFrontage)
head(df)

# SalePrice GrLivArea GarageYrBlt LotArea LotFrontage
# 1    208500      1710        2003    8450          65
# 2    181500      1262        1976    9600          80
# 3    223500      1786        2001   11250          68
# 4    140000      1717        1998    9550          60
# 5    250000      2198        2000   14260          84
# 6    143000      1362        1993   14115          85

When plotting the data and adding the linear model regression line, it shows how strongly a few outliers distort the model.

We will try to remove these outliers by using Mahalanobis without including the response variable.

Calculating Mahalanobis Distance

By setting a threshold to the Mahalanobis Distance values calculated below, I am creating a binary outlier variable.

# Calculate Mahalanobis with predictor variables
df2 <- df[, -1]    # Remove SalePrice Variable
m_dist <- mahalanobis(df2, colMeans(df2), cov(df2))
df$MD <- round(m_dist, 1)

# Binary Outlier Variable
df$outlier <- "No"
df$outlier[df$MD > 20] <- "Yes"    # Threshold set to 20

After marking outliers, we can see that the detection rate is quite good. Most major outliers were detected.

Rebuilding the Model

When removing the detected outliers and drawing a new regression line, the result is much better then before, though far from perfect.

Some more experimenting with the detection threshold might help. Additional things you can try to improve results are:

Experiment with different predictor variables
Try single feature thresholds – in that case that might have worked quite well!

Conclusion

Dealing with outliers is a critical part of model development. Both discussed methods – single feature thresholds and Mahalanobis Distance – provide good tools to detect point outliers. There are many others – so read up on it to make your predictive models more accurate.

Further References and Links

Wikipedia Article about Outliers
Wikipedia Article about Mahalanobis Distance
Eureka Statistics Article
Source Code and Data on my GitHub Page
Housing Data from Kaggle

To leave a comment for the author, please follow the link and comment on their blog: R – Steffen Ruefer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.