Coresets for kernel clustering

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang
{"title":"Coresets for kernel clustering","authors":"Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang","doi":"10.1007/s10994-024-06540-z","DOIUrl":null,"url":null,"abstract":"<p>We devise coresets for kernel <span>\\(k\\)</span>-<span>Means</span> with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel <span>\\(k\\)</span>-<span>Means</span> has superior clustering capability compared to classical <span>\\(k\\)</span>-<span>Means</span>, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel <span>\\(k\\)</span>-<span>Means</span> that works for a general kernel and has size <span>\\({{\\,\\textrm{poly}\\,}}(k\\epsilon ^{-1})\\)</span>. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in <i>n</i>. This result immediately implies new algorithms for kernel <span>\\(k\\)</span>-<span>Means</span>, such as a <span>\\((1+\\epsilon )\\)</span>-approximation in time near-linear in <i>n</i>, and a streaming algorithm using space and update time <span>\\({{\\,\\textrm{poly}\\,}}(k \\epsilon ^{-1} \\log n)\\)</span>. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel <span>\\(\\textsc {k-Means++}\\)</span> (the kernelized version of the widely used <span>\\(\\textsc {k-Means++}\\)</span> algorithm), and we further use this faster kernel <span>\\(\\textsc {k-Means++}\\)</span> for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06540-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

We devise coresets for kernel \(k\)-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel \(k\)-Means has superior clustering capability compared to classical \(k\)-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel \(k\)-Means that works for a general kernel and has size \({{\,\textrm{poly}\,}}(k\epsilon ^{-1})\). Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in n. This result immediately implies new algorithms for kernel \(k\)-Means, such as a \((1+\epsilon )\)-approximation in time near-linear in n, and a streaming algorithm using space and update time \({{\,\textrm{poly}\,}}(k \epsilon ^{-1} \log n)\). We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel \(\textsc {k-Means++}\) (the kernelized version of the widely used \(\textsc {k-Means++}\) algorithm), and we further use this faster kernel \(\textsc {k-Means++}\) for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.

Abstract Image

内核聚类的核集
我们为具有一般核的核(k/)-Means设计了核集,并利用它们获得了更高效的新算法。与经典的(k)-Means相比,核(k)-Means具有更优越的聚类能力,尤其是当聚类是非线性可分离的时候,但它也带来了巨大的计算挑战。我们通过构建一个核心集来解决这个计算问题,核心集是一个缩小了的数据集,它能准确地保留聚类成本。我们的主要成果是一个适用于一般内核、大小为 \({{\,\textrm{poly}\,}}(k\epsilon ^{-1})\)的内核 \(k\)-Means 的核心集。我们的新内核既概括了之前的所有结果,又大大改进了这些结果;此外,它可以在接近 n 线性的时间内构造出来。这一结果立即意味着核(k)-均值的新算法,比如在时间上接近于 n 的 \((1+\epsilon )\)-approximation 算法,以及使用空间和更新时间的流算法 \({{\,textrm{poly}\,}(k\epsilon ^{-1} \log n)\)。我们用不同的内核在各种数据集上验证了我们的核心集。我们的核心集始终表现出色,在使用极少量点的情况下误差很小。我们的研究表明,我们的核心集可以加快核(\textsc {k-Means++}\)(广泛使用的核(\textsc {k-Means++}\)算法的核化版本)的速度,我们还将这种更快的核(\textsc {k-Means++}\)用于光谱聚类。在这两种应用中,我们都实现了显著的提速和更好的渐进增长,而误差则与不使用核集的基线相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Machine Learning
Machine Learning 工程技术-计算机:人工智能
CiteScore
11.00
自引率
2.70%
发文量
162
审稿时长
3 months
期刊介绍: Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信