Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang
{"title":"Coresets for kernel clustering","authors":"Shaofeng H. -C. Jiang, Robert Krauthgamer, Jianing Lou, Yubo Zhang","doi":"10.1007/s10994-024-06540-z","DOIUrl":null,"url":null,"abstract":"<p>We devise coresets for kernel <span>\\(k\\)</span>-<span>Means</span> with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel <span>\\(k\\)</span>-<span>Means</span> has superior clustering capability compared to classical <span>\\(k\\)</span>-<span>Means</span>, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel <span>\\(k\\)</span>-<span>Means</span> that works for a general kernel and has size <span>\\({{\\,\\textrm{poly}\\,}}(k\\epsilon ^{-1})\\)</span>. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in <i>n</i>. This result immediately implies new algorithms for kernel <span>\\(k\\)</span>-<span>Means</span>, such as a <span>\\((1+\\epsilon )\\)</span>-approximation in time near-linear in <i>n</i>, and a streaming algorithm using space and update time <span>\\({{\\,\\textrm{poly}\\,}}(k \\epsilon ^{-1} \\log n)\\)</span>. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel <span>\\(\\textsc {k-Means++}\\)</span> (the kernelized version of the widely used <span>\\(\\textsc {k-Means++}\\)</span> algorithm), and we further use this faster kernel <span>\\(\\textsc {k-Means++}\\)</span> for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06540-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
We devise coresets for kernel \(k\)-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel \(k\)-Means has superior clustering capability compared to classical \(k\)-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel \(k\)-Means that works for a general kernel and has size \({{\,\textrm{poly}\,}}(k\epsilon ^{-1})\). Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in n. This result immediately implies new algorithms for kernel \(k\)-Means, such as a \((1+\epsilon )\)-approximation in time near-linear in n, and a streaming algorithm using space and update time \({{\,\textrm{poly}\,}}(k \epsilon ^{-1} \log n)\). We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel \(\textsc {k-Means++}\) (the kernelized version of the widely used \(\textsc {k-Means++}\) algorithm), and we further use this faster kernel \(\textsc {k-Means++}\) for spectral clustering. In both applications, we achieve significant speedup and a better asymptotic growth while the error is comparable to baselines that do not use coresets.
期刊介绍:
Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.