MapReduce algorithms for robust center-based clustering in doubling metrics

IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
{"title":"MapReduce algorithms for robust center-based clustering in doubling metrics","authors":"","doi":"10.1016/j.jpdc.2024.104966","DOIUrl":null,"url":null,"abstract":"<div><p>Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the <span><math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math></span>-clustering problem, where, given a pointset <em>P</em> from a metric space, one must determine a subset <em>S</em> of <em>k</em> centers minimizing the sum of the <em>ℓ</em>-th powers of the distances of points in <em>P</em> from their closest centers. This formulation covers the well-studied <em>k</em>-median (<span><math><mi>ℓ</mi><mo>=</mo><mn>1</mn></math></span>) and <em>k</em>-means (<span><math><mi>ℓ</mi><mo>=</mo><mn>2</mn></math></span>) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter <em>z</em> and allows up to <em>z</em> points of <em>P</em> (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the <span><math><mo>(</mo><mi>k</mi><mo>,</mo><mi>ℓ</mi><mo>)</mo></math></span>-clustering problem with <em>z</em> outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension <em>D</em>. Remarkably, for <span><math><mi>D</mi><mo>=</mo><mi>O</mi><mrow><mo>(</mo><mn>1</mn><mo>)</mo></mrow></math></span>, our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term <span><math><mi>O</mi><mo>(</mo><mi>γ</mi><mo>)</mo></math></span> away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where <em>γ</em> can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524001308/pdfft?md5=cb18e100c10527217dd5c5739d4b41d9&pid=1-s2.0-S0743731524001308-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731524001308","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the (k,)-clustering problem, where, given a pointset P from a metric space, one must determine a subset S of k centers minimizing the sum of the -th powers of the distances of points in P from their closest centers. This formulation covers the well-studied k-median (=1) and k-means (=2) clustering problems. A more general variant, introduced to deal with noisy pointsets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the sum. We present a distributed coreset-based 3-round approximation algorithm for the (k,)-clustering problem with z outliers, using MapReduce as a computational model. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. Remarkably, for D=O(1), our algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for metrics with constant doubling dimension.

基于中心聚类的稳健加倍度量 MapReduce 算法
聚类是无监督学习和数据分析的关键基础。聚类问题是一个流行的变体,在这个问题中,给定一个度量空间中的点集,必须确定一个中心子集,该中心子集应使各点与其最近中心的距离的-次幂之和最小。这种表述方式涵盖了已被广泛研究的-中值()和-均值()聚类问题。为了处理嘈杂的点集,我们引入了一种更通用的变体,其特点是增加了一个参数,并允许在计算总和时忽略(离群值)最多的点。我们以 MapReduce 为计算模型,针对有异常值的聚类问题提出了一种基于分布式核心集的三轮近似算法。我们算法的一个重要特点是,它能无意识地适应数据集的内在复杂性,而数据集的内在复杂性是由其翻倍维度所决定的。值得注意的是,对于 ,我们的算法每个还原器需要亚线性本地内存,并产生一个解决方案,其近似率与已知最佳顺序算法(可能是双标准算法)的近似率相差一个加法项,而后者的近似率可以任意变小。据我们所知,以前没有一种分布式方法能对具有恒定加倍维度的指标实现类似的质量-性能权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing 工程技术-计算机:理论方法
CiteScore
10.30
自引率
2.60%
发文量
172
审稿时长
12 months
期刊介绍: This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing. The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信