[This article was first published on ExploringDataBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Over the years, I have looked at a lot of data contaminated with outliers, the subject of Chapter 7 of Exploring Data in Engineering, the Sciences, and Medicine . That chapter adopts the definition of an outlier presented by Barnett and Lewis in their book Outliers in Statistical Data 2nd Edition , that outliers are “data points inconsistent with the majority of values in a dataset.” The practical importance of outliers lies in the fact that even a few of them can cause standard data characterization and analysis procedures to give highly misleading results. Robust procedures have been developed to address these problems, and their performance in the face of outliers can be dramatically better than that of standard methods. On the other hand, robust procedures can also fail, and when they do, they often fail spectacularly.
The above figure is a plot of successive measurements of the makeup flow rate from an industrial manufacturing process. This example is introduced in Chapter 2 of Exploring Data and is discussed in detail in Chapter 7; the dataset itself is available as the dataset makeup.csv from the companion website
for the book. Each point in this dataset corresponds to a measurement of the flow rate of a liquid component in a process with a recycle stream: most of this component is recovered from a downstream processing step and recycled back to the upstream step, but a small amount of this material is lost to evaporation and must be “made up” by a separate feed stream. The data points plotted above correspond to the flow rate of this makeup stream. The quantile function in R computes the Tukey five number summary of the data sequence (i.e., the minimum value, lower quartile, median, upper quartile, and maximum value). Applied to this sequence, the results are:
> quantile(MakeUpFlow)
0% 25% 50% 75% 100%
0.00086 370.43747 393.35861 404.29602 439.59091
>
Looking at the plot, it is clear that these observations partition naturally into two subsets: those values between 0 and about 200 – all falling between the minimum value and the lower quartile – and those values between about 300 and 440. In fact, 12.2% of the data values lie between 0 and 10, and 22.3% of the values are less than 200. Physically, these smaller-than-average flow rates correspond to shutdown episodes, where the manufacturing process is either not running, is in the process of being shut down, or is in the process of being started back up. If we compute the mean of the complete dataset, we obtain an average makeup flow rate of 315.46, corresponding to the dotted line in the above plot. It is clear that while this number represents the center of the overall data distribution, it is not representative of either the shutdown episodes or normal process operation, since it forms a perfect dividing line between these two data subsets. If we exclude the shutdown episodes – specifically, if we omit all values smaller than 200 – and compute the mean of what remains, we obtain 397.56, corresponding to the solid line in the plot above. It is clear that this number much better represents “typical process operation.”
This example illustrates the well-known outlier sensitivity of the mean. An outlier-resistant alternative is the median, defined as follows: if the number N of data values is odd, the median is the middle element in the ordered list we obtain by sorting these values from smallest to largest; if N is even, the median is the average of the two middle elements in this list. Applied to the complete makeup flow rate dataset, the median gives a “typical” flow rate of 393.36, fairly close to the average value computed without the shutdown episodes. Excluding these episodes changes the median value only a little, to 399.50, again quite close to the mean value after we remove these anomalous data points.
The sensitivity of the standard deviation to outliers is even worse than that of the mean. For the complete dataset, the computed standard deviation is 155.05, an order of magnitude larger than the value (15.22) computed from the dataset without the shutdown episodes. While it is not as well known as the median, an outlier-resistant alternative to the standard deviation is the MADM scale estimator, constructed as follows. First, compute the median as a reference value and compute the differences between every data point and this reference value. Then, take the absolute values of these differences, giving a set of non-negative numbers that tell us how far each point lies from the median reference value. Next, we compute the median of this sequence of absolute differences, which tells us how far a typical data point lies from the reference value. Because we have used the outlier-resistant median both to compute the reference value and to compute this “typical distance” number, the result is highly outlier-resistant. The only difficulty is that, for approximately normally distributed data with no outliers, this number is consistently smaller than the standard deviation. To obtain an alternative estimator of the standard deviation – i.e., a number that is approximately equal to the standard deviation when we compute it from a large, outlier-free collection of approximately Gaussian data – we need to multiply the number just defined by 1.4826. (A derivation of this result is given in Chapter 7 of Exploring Data.) This bias-corrected scale estimate is available as the built-in function mad in R, and applying it to the makeup flow rate dataset gives a scale estimate of 20.22, about the same magnitude as the standard deviation of the data sequence with the shutdown episodes removed. The MADM scale estimate is not completely immune to outliers, so when it is applied to the dataset without the shutdown episodes, it gives a somewhat smaller scale estimate of 13.40. Still, both of these estimates are much more consistent with the standard deviation of the normal operation data than the uncorrected standard deviation computed from the complete dataset.
The key point of this discussion has been to provide a simple, real-data illustration of both the considerable outlier sensitivity of the “standard” location and scale estimators (i.e., the mean and standard deviation) and the existence and performance of robust alternatives to both (i.e., the median and MADM scale estimator). As I noted at the beginning of this discussion, however, while robust estimators can work extremely well – as in this example – they can also fail spectacularly. In general, robust estimators are designed to protect us against anomalous observations that occur out in the tails of the distribution. In the case of “light-tailed” distributions like the uniform distribution, robust measures like the MADM scale estimator can exhibit surprising behavior. In particular, in the absence of outliers, we expect the standard deviation and the MADM scale estimate to give us roughly comparable results. If the standard deviation is much larger than the MADM scale estimate – as in the makeup flow rate example just discussed – this can usually be taken as an indication of either a naturally heavy-tailed data distribution (e.g., the Student’s t distribution with few degrees of freedom), or, as in this example, the presence of contaminating outliers. In the face of light-tailed data distributions, however, the MADM scale estimate can be substantially larger than the standard deviation, and we need to be aware of this fact if we use the MADM scale estimator as a measure of data spread.
More serious problems arise in the case of coarsely quantized data. In particular, it was once popular in engineering applications to record variables like temperature only to single digit accuracy (e.g., temperature to the nearest tenth of a degree). This practice was motivated both by limited memory available in early process monitoring computers, and by a recognition that this was roughly the limit of measurement accuracy, anyway. This kind of data truncation does degrade traditional characterizations like the standard deviation somewhat, but the consequences for the MADM scale estimate can be catastrophic. The plot above shows a sequence of 100 Gaussian data samples, with mean 0 and standard deviation 0.15, represented two ways. The line represents the original Gaussian data sample, while the solid circles correspond to these data values truncated to a single digit (e.g., the first value in the original data sequence is -0.11658569, which gets truncated to -0.1). Both the standard deviation and the MADM scale estimate are reasonably accurate when computed from the unmodified data sequence: the standard deviation is 0.137, while the MADM scale estimate is 0.133, both within about 10% of the correct value of 0.150. If we compute the standard deviation from the truncated data samples, the value declines to 0.103, reflecting the fact that by truncating the data from the 9 digits generated by the random number generator to 1, we are consistently reducing the magnitude of each number, so now our estimation error is more like 30%. Here, however, the MADM scale estimate fails completely, returning the value 0. The reason for this is that, on truncation, 55% of these data values are zero (note that any value between -0.09 and 0.09 will be truncated to zero, and these data values occur quite commonly in a zero-mean, normally distributed data sequence with standard deviation 0.15). Whenever more than 50% of the data values are equal, we don’t need to bother computing either the median or the MADM scale estimator. In particular, the median of this data sequence is simply equal to this majority value. Similarly, since more than 50% of the observed data values lie a distance of zero from the median, the median of the absolute deviation sequence on which the MADM scale estimate is based is necessarily zero. Thus, the MADM scale for any such sequence is zero, as in this example; this behavior is commonly called implosion.
This coarse quantization example indirectly illustrates one of the key distinctions between continuous and discrete variables: for a continuously distributed random variable, no two observations can have exactly the same value, but for discrete random variables, such ties are characteristic. The coarse quantization example represents something in between, being a discrete approximation of a continuous variable. The distinction is important because it is precisely these ties that were responsible for the complete failure of the MADM scale estimator in the example just described. There are other cases where the underlying data distribution has both continuous and discrete components, and these cases can also cause rank- and order-based data characterizations like the MADM scale estimator to behave very badly. The plot above shows rainfall measurements made every half hour for two months (a total of 732 data points), included in the dataset associated with Chapter 14 (“Half-hourly Precipitation and Streamflow, River Hirnant, Wales, U.K., November and December, 1972”) from the book Data:: A Collection of Problems from Many Fields for the Student and Research Worker (Springer Series in Statistics) , by D.F. Andrews and A.M. Herzberg. It was noted above that the built-in function quantile in R generates the Tukey five-number summary for a data sequence, but by specifying the prob parameter, it is possible to obtain finer-grained data characterizations. For example, the following command gives the deciles of this rainfall dataset:
>quantile(Rain,prob=seq(0,1,0.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.146 0.440 2.800
>
It is clear from these results that more than 60% of these recorded rainfall values are identically zero, reflecting the fact that it does not rain most days. As with the coarse quantization example discussed above, the MADM scale estimator implodes for this example, returning the value zero. Similarly, the median value of these numbers is also zero, but the mean is not (specifically, it is 0.136), and the standard deviation is not (this value is 0.354).
Even though the mean and standard deviation give more reasonable-sounding results for the rainfall dataset than their robust counterparts do, it is not completely clear how to interpret these numbers. In particular, we might be interested in asking any or all of the following three types of question:
1. What is the typical rainfall for this region in a month? (or in a week, in the entire two-month period, in a day, etc.)
2. How frequently does it rain?
3. When it does rain, what is the average amount?
Characterizations like the mean and standard deviation may be used to address the first of these three questions, but not the other two. As a preliminary answer to the second question, we can note that rainfall occurs in 34.6% of the half-hourly intervals recorded here, but this requires a slightly more detailed look at the data than what the mean and standard deviation give us. Additional work would be required if we wanted an estimate of how probable it was to rain on any given day, since we would need to summarize the data at the level of days instead of half-hourly observations. Finally, to address questions of the third type, it is necessary to segment the dataset into a “rainy part” and a “dry part.” Note that this segmentation is analogous to that made for the makeup flow rate dataset, but the difference is that here we are interested in characterizing the minority part of the dataset (i.e., “the outliers”) instead of the majority part, which is what the robust estimators are good at. Special types of statistical models have been developed to deal with these situations, and I will discuss some of these in my next post. For now, I will conclude with two key observations.
The first is that of Collin Mallows (C.L. Mallows, “Robust methods – some examples of their use,” American Statistician, vol. 33, 1979, pages 179-184), which I quote at the end of Chapter 7 of Exploring Data:
“A simple and useful strategy is to perform one’s analysis both robustly and by standard methods and to compare the results. If the differences are minor, either set can be presented. If the differences are not, one must perforce consider why not, and the robust analysis is already at hand to guide the next steps.The importance of these considerations is enhanced when we are dealing with large amounts of data, since then examining all the data in detail is impractical, and we are forced to contemplate working with data that is, almost certainly, partially bad and with models that are almost certainly inadequate.”
The second observation is that Mallows advice is practical for almost any type of data analysis we want to consider, since robust methods exist to handle a very wide variety of different analysis tasks. An excellent introduction to many of these robust methods is given by Rand Wilcox in his book Introduction to Robust Estimation and Hypothesis Testing, Second Edition (Statistical Modeling and Decision Science) , which includes R procedures for implementing most of the methods he discusses.
To leave a comment for the author, please follow the link and comment on their blog: ExploringDataBlog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.