Locally adapative k-nearest neighbour classification
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Abstract
Binary classification is one of the cornerstones of modern data science, but, until recently, our understanding of classical methods such as the k-nn algorithm was limited to settings where feature vectors were compactly supported. Based on a new analysis of this classifier, we propose a variant with significantly lower risk for heavy-tailed distributions.
The k-nearest neighbour classifier
The basic classifier that we consider here was introduced by Fix and Hodges (1951), and is arguably the simplest and most intuitive nonparametric classifier. For some fixed value of k, we classify a test point X to the class that is most prevalent among the k points in our test data which lie closest to X.
In the simple example in Figure 1 where the black point is to be classified, with k=1 we assign to the green class, with k=2 we assign to the red class (ties being broken in favour of the red class), and with k=3 we assign to the green class.
The choice of k will clearly have a huge impact on the performance of the classifier. It is often chosen using cross-validation so that, over many splits of the data set into test set and training set, we choose the value of k that gives the most accurate predictions over all points.
Our main finding is that it is often possible to achieve better performance when the value of k is allowed to depend on the location of the test point. Indeed, for some constant B (which may be chosen by cross-validation), when n is the sample size, ˉf is the marginal density of the data points, and x is the test point, we find that a choice of k roughly equal to
B{nˉf(x)}4/(d+4)
Theoretical results
Given full knowledge using knowledge of the underlying data-generating mechanism, the optimal decision rule is the Bayes classifier, which assigns points to the class with largest posterior probability. As even this optimal classifier makes mistakes, we typically evaluate the performance of a data-driven classification rule by comparing it to the Bayes classifier. Given a classification rule C:Rd→{0,1}, define its excess risk to be
EP(C)=PP{C(X)≠Y}–PP{CB(X)≠Y},
Our results hold over classes Pd,ρ of distributions of (X,Y) on Rd×{0,1} satisfying certain regularity conditions, including that they have twice-differentiable densities and a bounded ρth moment. Ignoring sub-polynomial factors, we find that the standard fixed-k nearest neighbour classifier, trained on a data set of size n, satisfies
supP∈Pd,ρEP(Cknnn)=O(1k+(kn)min(4/d,ρ/(ρ+d))).
On the other hand, we find that when k is chosen according to (1) above, we have
supP∈Pd,ρEP(CkOnnn)=O(n−min(ρ/(ρ+d),4/(4+d))).
Simulation
To illustrate the potential benefits of the local procedure, consider the following simulation. Following Cannings, Berrett and Samworth (2020), we take n1=n0=100,
P1=t5×t5 and P0=N(1,1)×t5,
The data is displayed in the left-most plot of Figure 2, together with vertical lines indicating the action of Bayes classifier, which selects class 0 when the data points lie between the two lines and selects class 1 otherwise. First, we run the standard k-nearest neighbour classifier on the data, with the value of k chosen by leave-one-out cross-validation from {1,…,20}. The middle plot of Figure shows those points of the data set for which the standard classifier classifies differently to the Bayes classifier. Finally, we run our local classifier, where we assume that ˉf is known, and where the value of B is chosen by leave-one-out cross-validation on a grid of size 20.
data:image/s3,"s3://crabby-images/3e242/3e242ce5b0e97a4175b845744a8f71e2c98d3b05" alt=""
Perhaps the most striking aspect of the results is that the local classifier agrees with the Bayes classifier much more often than the standard classifier, with only 9 disagreements compared to the 43 disagreements of the standard classifier. Looking a little closer, we can see that the remaining mistakes that the local classifier makes are concentrated around the Bayes decision boundaries. Standard theoretical analysis of classification problems reveals that such points typically represent the hardest point to classify. We finally see that many, though by no means all, of the points at which the standard classifier makes a mistake appear in low-density regions, for example towards the bottom of the plots.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.