Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this tutorial I will discuss how to detect outliers in a multivariate dataset without using the response variable. I will first discuss about outlier detection through threshold setting, then about using Mahalanobis Distance instead.
The Problem
Outliers are data points that do not match the general character of the dataset. Often they are extreme values that fall outside of the “normal” range; one way of dealing with such values is to take out the highest and the lowest values of a variable. This can work quite well, but does not take into account variable combinations.
What is the problem with having outliers in the data? Sometimes there is no problem with it at all, in fact, outliers can be beneficial to understanding special characteristics of the data. In other cases, outliers might be simply mistakes in the data (i.e. noise); if you do not identify them, your predictive model will be less accurate in making predictions.
A Simple Example
The Height-Weight Dataset
To illustrate I will use a sample dataset containing height and weight data of male adults. Below is the scatter plot showing height vs weight. The data points appear normal distributed and there are some extreme values visible that are not part of the “cloud” of points around the center.
Remove Outliers with Feature Thresholds
As mentioned already, one way to deal with outliers is to set minimum and maximum thresholds to mark outliers. In this case, after visual inspection, I set the following limits (for example purpose only – no science involved!):
- height outliers: above 187 cm or below 160 cm
- weight outliers: above 72 kg or below 41 kg
Notice that many outliers were detected by using thresholds – however, most of them are closer to the regression line than the outliers that were missed.
Why should the “missed outliers” be outliers in the first place? From a model perspective, they are far from the regression line, which means such outliers will cause larger errors.
But it is more interesting to interpret these data points directly. Being a tall person does not make someone an exception – but being tall and having a very low weight does! One of the marked data points represents a person that is approx 1.85 m (6 ft 8 inches) tall, with a body weight of only 45 kg (99 pounds). Clearly this person is seriously under weight, and yet it slipped through the detection threshold.
Use Mahalanobis Distance
The Mahalanobis distance is a measure of the distance between a point P and a distribution D, as explained here. I will not go into details as there are many related articles that explain more about it. I will only implement it and show how it detects outliers. The complete source code in R can be found on my GitHub page.
# Calculate Mahalanobis Distance with height and weight distributions m_dist <- mahalanobis(df[, 1:2], colMeans(df[, 1:2]), cov(df[, 1:2])) df$m_dist <- round(m_dist, 2) # Mahalanobis Outliers - Threshold set to 12 df$outlier_maha <- "No" df$outlier_maha[df$m_dist > 12] <- "Yes" # Scatterplot with Maha Outliers ggplot(df, aes(x = weight, y = height, color = outlier_maha)) + geom_point(size = 5, alpha = 0.6) + labs(title = "Weight vs Height", subtitle = "Outlier Detection in weight vs height data - Using Mahalanobis Distances", caption = "Source: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights") + ylab("Height in cm") + xlab("Weight in kg") + scale_y_continuous(breaks = seq(160, 200, 5)) + scale_x_continuous(breaks = seq(35, 80, 5))
The scatterplot shows that all previously missed outliers were detected this time, and plenty of single feature “extreme” values were not declared as outliers.
Some data points that were previously also found as being outliers were still detected though – the ones on the very far end of the scale. While these might not have a large effect on the regression model, they are still outliers. Very tall or very short people, even without being over or under weight, are still rare and therefore fall into this category.
In this dataset I was using the response variable to detect outliers – this is usually not the case. In the next example I will use a dataset with more variables and try to detect outliers by using only the available predictor variables.
A Multivariate Example
The Housing Dataset
I will use a simplified version of the housing dataset, provided by Kaggle. The original data contains 80 predictor variables, but for the purpose of this tutorial I will reduce it to 4 predictor variables and the response variable:
- Response Variable: SalePrice
- Predictor Variables: GrLivArea, GarageYrBuilt, LotArea, LotFrontage
# Housing Dataset df <- read.csv("train.csv") # Select only 5 features - SalePrice is the response variable df <- df %>% select(SalePrice, GrLivArea, GarageYrBlt, LotArea, LotFrontage) head(df) # SalePrice GrLivArea GarageYrBlt LotArea LotFrontage # 1 208500 1710 2003 8450 65 # 2 181500 1262 1976 9600 80 # 3 223500 1786 2001 11250 68 # 4 140000 1717 1998 9550 60 # 5 250000 2198 2000 14260 84 # 6 143000 1362 1993 14115 85
When plotting the data and adding the linear model regression line, it shows how strongly a few outliers distort the model.
We will try to remove these outliers by using Mahalanobis without including the response variable.
Calculating Mahalanobis Distance
By setting a threshold to the Mahalanobis Distance values calculated below, I am creating a binary outlier variable.
# Calculate Mahalanobis with predictor variables df2 <- df[, -1] # Remove SalePrice Variable m_dist <- mahalanobis(df2, colMeans(df2), cov(df2)) df$MD <- round(m_dist, 1) # Binary Outlier Variable df$outlier <- "No" df$outlier[df$MD > 20] <- "Yes" # Threshold set to 20
After marking outliers, we can see that the detection rate is quite good. Most major outliers were detected.
Rebuilding the Model
When removing the detected outliers and drawing a new regression line, the result is much better then before, though far from perfect.
Some more experimenting with the detection threshold might help. Additional things you can try to improve results are:
- Experiment with different predictor variables
- Try single feature thresholds – in that case that might have worked quite well!
Conclusion
Dealing with outliers is a critical part of model development. Both discussed methods – single feature thresholds and Mahalanobis Distance – provide good tools to detect point outliers. There are many others – so read up on it to make your predictive models more accurate.
Further References and Links
Wikipedia Article about Outliers
Wikipedia Article about Mahalanobis Distance
Eureka Statistics Article
Source Code and Data on my GitHub Page
Housing Data from Kaggle
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.