k is the magic number: inferring the number of clusters through nonparametric concentration inequalities

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

3 Citations (Scopus)
3 Downloads (Pure)

Abstract

Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost-measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results.
Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Proceedings
Subtitle of host publicationEuropean Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I
EditorsUlf Brefeld, Elisa Fromont, Andreas Hotho, Arno Knobbe, Marloes Maathuis, Céline Robardet
Place of PublicationCham
PublisherSpringer Nature
Pages257-273
Number of pages17
ISBN (Electronic)978-3-030-46150-8
ISBN (Print)978-3-030-46149-2
DOIs
Publication statusPublished - 2020
Event2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019) - Wurzburg, Germany
Duration: 16 Sept 201920 Sept 2019
Conference number: 19
http://ecmlpkdd2019.org/

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11906 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019)
Abbreviated titleECML PKDD 2019
Country/TerritoryGermany
CityWurzburg
Period16/09/1920/09/19
Internet address

Keywords

  • Concentration inequalities
  • Model selection
  • Nonparametric statistics
  • One-cluster clustering
  • Spectral clustering
  • k-means

Fingerprint

Dive into the research topics of 'k is the magic number: inferring the number of clusters through nonparametric concentration inequalities'. Together they form a unique fingerprint.

Cite this