Abstract
Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost-measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results.
Original language | English |
---|---|
Title of host publication | Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Proceedings |
Subtitle of host publication | European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I |
Editors | Ulf Brefeld, Elisa Fromont, Andreas Hotho, Arno Knobbe, Marloes Maathuis, Céline Robardet |
Place of Publication | Cham |
Publisher | Springer Nature |
Pages | 257-273 |
Number of pages | 17 |
ISBN (Electronic) | 978-3-030-46150-8 |
ISBN (Print) | 978-3-030-46149-2 |
DOIs | |
Publication status | Published - 2020 |
Event | 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019) - Wurzburg, Germany Duration: 16 Sept 2019 → 20 Sept 2019 Conference number: 19 http://ecmlpkdd2019.org/ |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 11906 LNAI |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019) |
---|---|
Abbreviated title | ECML PKDD 2019 |
Country/Territory | Germany |
City | Wurzburg |
Period | 16/09/19 → 20/09/19 |
Internet address |
Keywords
- Concentration inequalities
- Model selection
- Nonparametric statistics
- One-cluster clustering
- Spectral clustering
- k-means