{"title":"MapReduce algorithms for robust center-based clustering in doubling metrics","authors":"Enrico Dandolo , Alessio Mazzetto , Andrea Pietracaprina , Geppino Pucci","doi":"10.1016/j.jpdc.2024.104966","DOIUrl":null,"url":null,"abstract":"<div><p>Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the <span><math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math></span>-clustering problem, where, given a pointset <em>P</em> from a metric space, one must determine a subset <em>S</em> of <em>k</em> centers minimizing the sum of the <em>ℓ</em>-th powers of the distances of points in <em>P</em> from their closest centers. This formulation covers the well-studied <em>k</em>-median (<span><math><mi>ℓ</mi><mo>=</mo><mn>1</mn></math></span>) and <em>k</em>-means (<span><math><mi>ℓ</mi><mo>=</mo><mn>2</mn></math></span>) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter <em>z</em> and allows up to <em>z</em> points of <em>P</em> (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the <span><math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math></span>-clustering problem with <em>z</em> outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension <em>D</em>. Remarkably, for <span><math><mi>D</mi><mo>=</mo><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></math></span>, our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term <span><math><mi>O</mi><mo>(</mo><mi>γ</mi><mo>)</mo></math></span> away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where <em>γ</em> can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001308/pdfft?md5=cb18e100c10527217dd5c5739d4b41d9&pid=1-s2.0-S0743731524001308-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731524001308","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the -clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the ℓ-th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median () and k-means () clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the -clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for , our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.
期刊介绍:
This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing.
The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.