Samenvatting
Let P be a set of n colored points. We develop efficient data structures that store P and can answer chromatic k-nearest neighbor (k-NN) queries. Such a query consists of a query point q and a number k, and asks for the color that appears most frequently among the k points in P closest to q. Answering such queries efficiently is the key to obtain fast k-NN classifiers. Our main aim is to obtain query times that are independent of k while using near-linear space.
We show that this is possible using a combination of two data structures. The first data structure allow us to compute a region containing exactly the k-nearest neighbors of a query point q, and the second data structure can then report the most frequent color in such a region. This leads to linear space data structures with query times of O(n^{1/2} log n) for points in ℝ¹, and with query times varying between O(n^{2/3} log^{2/3} n) and O(n^{5/6} polylog n), depending on the distance measure used, for points in ℝ². These results can be extended to work in higher dimensions as well. Since the query times are still fairly large we also consider approximations. If we are allowed to report a color that appears at least (1-ε)f^* times, where f^* is the frequency of the most frequent color, we obtain a query time of O(log n + log log_{1/(1-ε)} n) in ℝ¹ and expected query times ranging between Õ(n^{1/2} ε^{-3/2}) and Õ(n^{1/2} ε^{-5/2}) in ℝ² using near-linear space (ignoring polylogarithmic factors).
We show that this is possible using a combination of two data structures. The first data structure allow us to compute a region containing exactly the k-nearest neighbors of a query point q, and the second data structure can then report the most frequent color in such a region. This leads to linear space data structures with query times of O(n^{1/2} log n) for points in ℝ¹, and with query times varying between O(n^{2/3} log^{2/3} n) and O(n^{5/6} polylog n), depending on the distance measure used, for points in ℝ². These results can be extended to work in higher dimensions as well. Since the query times are still fairly large we also consider approximations. If we are allowed to report a color that appears at least (1-ε)f^* times, where f^* is the frequency of the most frequent color, we obtain a query time of O(log n + log log_{1/(1-ε)} n) in ℝ¹ and expected query times ranging between Õ(n^{1/2} ε^{-3/2}) and Õ(n^{1/2} ε^{-5/2}) in ℝ² using near-linear space (ignoring polylogarithmic factors).
Originele taal-2 | Engels |
---|---|
Titel | 30th Annual European Symposium on Algorithms (ESA 2022) |
Redacteuren | Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, Grzegorz Herman |
Uitgeverij | Schloss Dagstuhl - Leibniz-Zentrum für Informatik |
Pagina's | 67:1-67:14 |
Aantal pagina's | 14 |
ISBN van elektronische versie | 978-3-95977-247-1 |
DOI's | |
Status | Gepubliceerd - 1 sep. 2022 |
Publicatie series
Naam | Leibniz International Proceedings in Informatics (LIPIcs) |
---|---|
Uitgeverij | Schloss Dagstuhl - Leibniz-Zentrum für Informatik |
Volume | 244 |
ISSN van elektronische versie | 1868-8969 |