Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I describe a kernel density approach to outlier detection on small datasets. In particular, my model is the set of prices for a given item that can be found online.
Introduction
Suppose you’re searching online for the cheapest place to buy a stats book you need for class.
You initially find the following prices:
$50 - Amazon $55 - Barnes & Noble $52 - Walmart $50 - Borders
You’re about to buy it from Amazon when you whimsically decide to try searching again. This time you find an eBay listing for the book at the low price of $30.
Should you buy it from eBay? Or is the listing just a scam? What if, instead, the price distribution of the other sellers were more widely dispersed?
$50 - Amazon $75 - Barnes & Noble $60 - Walmart $90 - Borders
In other words, given the price distribution of the other sellers, you’d like to know how likely it is that a price of $30 is legitimate.
I describe a solution to this problem based on kernel density estimation.
Other Approaches
Let’s consider some other approaches first. Suppose we have a set of prices
A Normal Approach
One simple approach to determining how cautious we should be of the price
However, this approach is flawed for several reasons. First, what if the prices in
A simple heuristic (say, flag
$20 - Amazon $25 - Barnes & Noble $22 - Walmart $21 - Borders $45 - eBay $40 - Books Inc. $43 - Waldenbooks
often arise, in which sellers seem to fall into two sets, one with higher prices and one with lower. (For example, perhaps a paperback version of a book was just released, and the megastore booksellers have slashed the price of the hardback version accordingly, while the small-scale booksellers have yet to react.)
In general, it is unclear what distribution prices fall into (they do not have a lognormal distribution either), so it seems that we must look at non-parametric approaches instead.
An IQR Approach
One non-parametric method (which we all learn in grade school) is the interquartile method, in which we calculate the interquartile range of
Again, though, this method fails. For example, in the bimodal distribution above, the 75th percentile of the prices is 41.50 and the 25th percentile is 21.5, giving a rather large interquartile range of 20.00. Even with a conservative choice of
We need, then, a non-parametric approach which also takes into account the dispersion and clusters of prices.
A Kernel Density Approach
Recall that the kernel density estimate of a price
where
In my tests, I used a Gaussian kernel with bandwidth
For example, given a set of original prices
here are the kernel density estimates of a new set of prices:
Notice the two humps around 50 and 90, which match the two clusters in our original set of prices.
Finally, to take care of cases where all prices were widely dispersed, I compared the kernel density estimate of
and flagged
Adding Sophistication with Local Comparisons
While the attempt above accounts for price dispersion in our outlier score by dividing against the average kernel density estimate of all prices, a better approach (e.g., in the case of bimodal distributions) would be to take the average only over points
For example, I experimented both with taking the
Although this method didn’t produce significantly better results than the method that ignored clusters in prices, perhaps due to the relative rarity of distinct price groups in small datasets, the method should be more robust as the number of online retailers and prices per item grows.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.