Anyone who’s taken a machine learning class is familiar with the nearest neighbors classifier—a simple, intuitive, aptly named concept that makes decisions based on proximity via a well-defined metric. Nearest neighbors often gets labeled as a “naive” baseline method, because of its simplicity and straightforwardness. Indeed, what clearer way is there of classifying an object than simply assigning to it the label of its closest friend? Even on the benchmark recognition dataset MNIST, the Euclidean L2-distance \(k\)-Nearest Neighbors (\(k\)-NN) classifier posts an error rate of 5% —not bad for a naive algorithm.
Of course, the main drawback of nearest neighbors is the computational burden of computing distances between many pairs of multidimensional vectors. For classification datasets that have orders of magnitudes more images than the MNIST dataset, nearest neighbors is simply not a viable solution. However, this changes when we enter the domain of few-shot learning, which asks a model to classify an object having seen only minimal training examples.
In their paper titled Revisiting Local Descriptor based Image-to-Class Measure for Few-shot Learning, Wenbin Li et al. revisit the \(k\)-NN classifier on the N-way K-shot recognition task. Based on observations of the strengths and weaknesses of the Naive Bayes Nearest Neighbors (NBNN) classifier described in the 2008 paper In Defense of Nearest-Neighbor Based Image Classification by O. Boiman, E. Shechtman and M. Irani, Wenbin Li et al. create a nearest neighbors classifier mounted on top of a convolutional neural network (CNN) without any fully connected layers. Approximately, their proposed Deep Nearest Neighbor Neural Network (DN4) learns to extract deep local descriptors in the CNN-based deep embedding module, and a subsequent image-to-class module constructs a class feature space from these local descriptors over all training images in a given class. Given a set of support images and a query image, we obtain a local descriptor-based representation of the query image as well as a local descriptor space for each class. The \(k\)-Nearest Neighbors of each local descriptor of the query image are computed in each class descriptor space, and the cosine similarity between the query descriptor and each of these neighbors is computed. The image-to-class similarity is then defined as the sum of all these cosine similarities. If the deep embedding module computes \(m\) local descriptors, for example, then the image-to-class similarity would be the sum of \(mk\) descriptor-level similarities computed by the image-to-class module.
The model was tested on four datasets: (1) miniImageNet, (2) Stanford Dogs, (3) Stanford Cars, and (4) CUB-200. The latter three are fine-grained datasets, in which the class specifications are much subtler than in a broad classification dataset such as ImageNet or COCO. The authors also compared the performance of the DN4 network against a variety of other models, including NBNN, standard \(k\)-NN, meta-learners, and other deep neural networks. On a 5-way 1-shot task, the DN4 model achieves 51.24% accuracy, a 0.8% improvement in accuracy over the next best model. In the 5-way 5-shot case (the support set is consists of five images per class), the DN4 model achieves 71.02% accuracy, which is more than 5% higher than the next best model. This is comparable to the accuracies achieved by the highest-performing meta-learning models. On the fine-grained datasets, the DN4 generally outperforms all other metric-based models tested by significant margins.
What I find compelling about this paper is that the DN4 model, while built on top of a CNN, shows how powerful the simple \(k\)-NN framework can be when it is given the proper similarity metric. The idea that similarity can be measured simply by “closeness” appears not only in nearest neighbors, but in other powerful dimensionality reduction methods in data science, such as diffusion maps and spectral clustering. After all, recognition at the highest levels is an approximation game—we recognize something because it is familiar or similar, and we are naturally inclined to search for these similarities when presented with a new item. Neural networks are undoubtedly powerful and flexible learners, but the intuition behind the crucial fully-connected layer is not so clear as the intuition behind nearest neighbors. I haven’t read so much of the literature on this, but I wonder if there are fundamental parallels between weight-learning in fully-connected layers and classification using nearest neighbors.