A Kernel Density Approach to Outlier Detection
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I describe a kernel density approach to outlier detection on small datasets. In particular, my model is the set of prices for a given item that can be found online.
Introduction
Suppose you’re searching online for the cheapest place to buy a stats book you need for class.
You initially find the following prices:
$50 - Amazon $55 - Barnes & Noble $52 - Walmart $50 - Borders
You’re about to buy it from Amazon when you whimsically decide to try searching again. This time you find an eBay listing for the book at the low price of $30.
Should you buy it from eBay? Or is the listing just a scam? What if, instead, the price distribution of the other sellers were more widely dispersed?
$50 - Amazon $75 - Barnes & Noble $60 - Walmart $90 - Borders
In other words, given the price distribution of the other sellers, you’d like to know how likely it is that a price of $30 is legitimate.
I describe a solution to this problem based on kernel density estimation.
Other Approaches
Let’s consider some other approaches first. Suppose we have a set of prices , where is strictly less than all the other prices in .
A Normal Approach
One simple approach to determining how cautious we should be of the price is to calculate the mean and standard deviation of , and flag if it falls some number of standard deviations below the mean.
However, this approach is flawed for several reasons. First, what if the prices in are all equal, say , so that ? Then any will be flagged, even , which we don’t want.
A simple heuristic (say, flag only if it also falls at least below ) can fix this problem, though, so a more glaring problem is that the approach assumes prices follow a normal distribution, which they often do not. For example, bimodal distributions such as the following
$20 - Amazon $25 - Barnes & Noble $22 - Walmart $21 - Borders $45 - eBay $40 - Books Inc. $43 - Waldenbooks
often arise, in which sellers seem to fall into two sets, one with higher prices and one with lower. (For example, perhaps a paperback version of a book was just released, and the megastore booksellers have slashed the price of the hardback version accordingly, while the small-scale booksellers have yet to react.)
In general, it is unclear what distribution prices fall into (they do not have a lognormal distribution either), so it seems that we must look at non-parametric approaches instead.
An IQR Approach
One non-parametric method (which we all learn in grade school) is the interquartile method, in which we calculate the interquartile range of (the difference between the price falling in the th percentile of and the price falling in the th percentile of ) and view with caution if it falls below for some positive constant .
Again, though, this method fails. For example, in the bimodal distribution above, the 75th percentile of the prices is 41.50 and the 25th percentile is 21.5, giving a rather large interquartile range of 20.00. Even with a conservative choice of , we flag outliers only if they fall below , which doesn’t seem correct.
We need, then, a non-parametric approach which also takes into account the dispersion and clusters of prices.
A Kernel Density Approach
Recall that the kernel density estimate of a price given prices is
where is some kernel function and is a bandwidth parameter.
In my tests, I used a Gaussian kernel with bandwidth ,
For example, given a set of original prices
here are the kernel density estimates of a new set of prices:
Notice the two humps around 50 and 90, which match the two clusters in our original set of prices.
Finally, to take care of cases where all prices were widely dispersed, I compared the kernel density estimate of to the average kernel density estimate over all prices in :
and flagged if its outlier score fell below some threshold.
Adding Sophistication with Local Comparisons
While the attempt above accounts for price dispersion in our outlier score by dividing against the average kernel density estimate of all prices, a better approach (e.g., in the case of bimodal distributions) would be to take the average only over points near :
For example, I experimented both with taking the nearest neighbors of , and with performing a -means clustering on the prices and taking the points in the same cluster as , where I estimated using the gap statistic and prediction strength tests.
Although this method didn’t produce significantly better results than the method that ignored clusters in prices, perhaps due to the relative rarity of distinct price groups in small datasets, the method should be more robust as the number of online retailers and prices per item grows.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.