Abstract
Let P be a set of n colored points. We develop efficient data structures that store P and can answer chromatic k-nearest neighbor (k-NN) queries. Such a query consists of a query point q and a number k, and asks for the color that appears most frequently among the k points in P closest to q. Answering such queries efficiently is the key to obtain fast k-NN classifiers. Our main aim is to obtain query times that are independent of k while using near-linear space.
We show that this is possible using a combination of two data structures. The first data structure allow us to compute a region containing exactly the k-nearest neighbors of a query point q, and the second data structure can then report the most frequent color in such a region. This leads to linear space data structures with query times of O(n^{1/2} log n) for points in ℝ¹, and with query times varying between O(n^{2/3} log^{2/3} n) and O(n^{5/6} polylog n), depending on the distance measure used, for points in ℝ². These results can be extended to work in higher dimensions as well. Since the query times are still fairly large we also consider approximations. If we are allowed to report a color that appears at least (1-ε)f^* times, where f^* is the frequency of the most frequent color, we obtain a query time of O(log n + log log_{1/(1-ε)} n) in ℝ¹ and expected query times ranging between Õ(n^{1/2} ε^{-3/2}) and Õ(n^{1/2} ε^{-5/2}) in ℝ² using near-linear space (ignoring polylogarithmic factors).
We show that this is possible using a combination of two data structures. The first data structure allow us to compute a region containing exactly the k-nearest neighbors of a query point q, and the second data structure can then report the most frequent color in such a region. This leads to linear space data structures with query times of O(n^{1/2} log n) for points in ℝ¹, and with query times varying between O(n^{2/3} log^{2/3} n) and O(n^{5/6} polylog n), depending on the distance measure used, for points in ℝ². These results can be extended to work in higher dimensions as well. Since the query times are still fairly large we also consider approximations. If we are allowed to report a color that appears at least (1-ε)f^* times, where f^* is the frequency of the most frequent color, we obtain a query time of O(log n + log log_{1/(1-ε)} n) in ℝ¹ and expected query times ranging between Õ(n^{1/2} ε^{-3/2}) and Õ(n^{1/2} ε^{-5/2}) in ℝ² using near-linear space (ignoring polylogarithmic factors).
Original language | English |
---|---|
Title of host publication | 30th Annual European Symposium on Algorithms (ESA 2022) |
Editors | Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, Grzegorz Herman |
Publisher | Schloss Dagstuhl - Leibniz-Zentrum für Informatik |
Pages | 67:1-67:14 |
Number of pages | 14 |
ISBN (Electronic) | 978-3-95977-247-1 |
DOIs | |
Publication status | Published - 1 Sept 2022 |
Publication series
Name | Leibniz International Proceedings in Informatics (LIPIcs) |
---|---|
Publisher | Schloss Dagstuhl - Leibniz-Zentrum für Informatik |
Volume | 244 |
ISSN (Electronic) | 1868-8969 |
Keywords
- data structure
- nearest neighbor
- classification