Site icon R-bloggers

Absolute Deviation Around the Median

[This article was first published on Kevin Davenport » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Median Absolute Deviation (MAD) or Absolute Deviation Around the Median as stated in the title, is a robust measure of central tendency. Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. This robustness is well illustrated by the median’s breakdown point Donoho & Huber, 1983. The interquartile range is also resistant to the influence of outliers, although the mean and median absolute deviation are better in that they can be converted into values that approximate the standard deviation.

Essentially the breakdown point for a parameter (median, mean, variance, etc.) is the proportion or number of arbitrarily small or large extreme values that must be introduced into a sample to cause the estimator to yield an arbitrarily bad result. The median’s breakdown point is .5 or half (the mean’s is 0). This means that the median only becomes “bad” when more than 50% of the observations are infinite.

For example:
If you have ordered set [2, 6, 6, 12, 17, 25 ,32], the median is 12 and the mean is 14.28. If you replace 32 with + ∞, the median stays the same (12), but the mean becomes infinite.

If we want to use MAD as a consistent estimator for the estimation of the standard deviation, we must use a constant “b” in the formula above (or just “K”) (Leys et al. 2012) depending on the distribution. b = 1.4826 when dealing with normally distributed data, but we’ll need to calculate a new “b” If a different underlying distribution is assumed: b = 1/ Q(0.75) (0.75 quantile of that underlying distribution).

To calculate the MAD, we find the median of absolute deviations from the median. In other words, the MAD is the median of the absolute values of the residuals (deviations) from the data’s median.

Using the same set from earlier:

  1. [(2 – 12), (6 – 12), (6 – 12), (12 – 12), (17 – 12), (25 – 12) ,(32 – 12)] Subtract median from each i
  2. |[-10, -6, -6, 0, 5, 13, 20]| Take the absolute value of the list
  3. [10, 6, 6, 0, 5, 13, 20] Find the median
  4. [10, 6, 6, 0, 5, 13, 20] -> [0, 5, 6, 6, 10, 13, 20] -> 6
  5. 6 * b ->  6 * 1.4826 = 8.8956

We now have our MAD (8.8956) to use in our predetermined threshold. Going back to our example set’s median of 12 we can use +/- 2 or 2.5 or 3 MAD. For example:
12 + 2*8.8956 = 29.7912 as out upper threshold
12 – 2*8.8956 = -5.7912 as out lower threshold

Using this criteria we can identify 32 as an outlier in our example set of [2, 6, 6, 12, 17, 25 ,32].

In R:

mad(x, center = median(x), constant = 1.4826,
    na.rm = FALSE, low = FALSE, high = FALSE)

To leave a comment for the author, please follow the link and comment on their blog: Kevin Davenport » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.