MapReduce algorithms for robust center-based clustering in doubling metrics

IF 3.4 3区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Parallel and Distributed Computing Pub Date : 2024-08-02 DOI:10.1016/j.jpdc.2024.104966

Enrico Dandolo , Alessio Mazzetto , Andrea Pietracaprina , Geppino Pucci

{"title":"MapReduce algorithms for robust center-based clustering in doubling metrics","authors":"Enrico Dandolo , Alessio Mazzetto , Andrea Pietracaprina , Geppino Pucci","doi":"10.1016/j.jpdc.2024.104966","DOIUrl":null,"url":null,"abstract":"<div>Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the <math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math>-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (<math><mi>ℓ</mi><mo>=</mo><mn>1</mn></math>) and k-means (<math><mi>ℓ</mi><mo>=</mo><mn>2</mn></math>) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the <math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math>-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for <math><mi>D</mi><mo>=</mo><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></math>, our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term <math><mi>O</mi><mo>(</mo><mi>γ</mi><mo>)</mo></math> away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.</div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"194 ","pages":"Article 104966"},"PeriodicalIF":3.4000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001308/pdfft?md5=cb18e100c10527217dd5c5739d4b41d9&pid=1-s2.0-S0743731524001308-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731524001308","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the $(k, ℓ)$ -clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median ( $ℓ = 1$ ) and k-means ( $ℓ = 2$ ) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the $(k, ℓ)$ -clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for $D = O (1)$ , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term $O (γ)$ away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.

查看原文本刊更多论文

基于中心聚类的稳健加倍度量 MapReduce 算法

聚类是无监督学习和数据分析的关键基础。聚类问题是一个流行的变体，在这个问题中，给定一个度量空间中的点集，必须确定一个中心子集，该中心子集应使各点与其最近中心的距离的-次幂之和最小。这种表述方式涵盖了已被广泛研究的-中值（）和-均值（）聚类问题。为了处理嘈杂的点集，我们引入了一种更通用的变体，其特点是增加了一个参数，并允许在计算总和时忽略（离群值）最多的点。我们以 MapReduce 为计算模型，针对有异常值的聚类问题提出了一种基于分布式核心集的三轮近似算法。我们算法的一个重要特点是，它能无意识地适应数据集的内在复杂性，而数据集的内在复杂性是由其翻倍维度所决定的。值得注意的是，对于，我们的算法每个还原器需要亚线性本地内存，并产生一个解决方案，其近似率与已知最佳顺序算法（可能是双标准算法）的近似率相差一个加法项，而后者的近似率可以任意变小。据我们所知，以前没有一种分布式方法能对具有恒定加倍维度的指标实现类似的质量-性能权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Parallel and Distributed Computing 工程技术-计算机：理论方法

CiteScore

10.30

自引率

2.60%

发文量

172

审稿时长

12 months

期刊介绍： This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.