κ-Center Clustering with Outliers in the MPC and Streaming Model

Onderzoeksoutput: Hoofdstuk in Boek/Rapport/CongresprocedureConferentiebijdrageAcademicpeer review

6 Citaten (Scopus)

Samenvatting

Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set (Equation Presented)of k centers such that the maximum distance of all but z points of P to their nearest center in C∗ is minimized. An (ϵ, k, z)-coreset for this problem is a weighted point set P∗ such that an optimal solution for the k-center problem with z outliers on P∗ gives a (1 ± ϵ)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ϵ ≤ 1: In all cases, the size of the computed coreset is O(k/ϵd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using (Equation Presented)machines, where the worker machines have (Equation Presented)local memory, and the coordinator has (Equation Presented local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ϵ, k, z)-coreset must use Ω(k/ϵd + z) space. We complement this by a deterministic streaming algorithm using O(k/ϵd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/ϵd + z)log4(kΔ/ϵδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/ϵd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ϵ)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/ϵd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].

Originele taal-2Engels
TitelProceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023
UitgeverijInstitute of Electrical and Electronics Engineers
Pagina's853-863
Aantal pagina's11
ISBN van elektronische versie9798350337662
DOI's
StatusGepubliceerd - 2023
Evenement37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023 - St. Petersburg, Verenigde Staten van Amerika
Duur: 15 mei 202319 mei 2023

Congres

Congres37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023
Land/RegioVerenigde Staten van Amerika
StadSt. Petersburg
Periode15/05/2319/05/23

Bibliografische nota

Publisher Copyright:
© 2023 IEEE.

Vingerafdruk

Duik in de onderzoeksthema's van 'κ-Center Clustering with Outliers in the MPC and Streaming Model'. Samen vormen ze een unieke vingerafdruk.

Citeer dit