Marigold: Efficient k-means Clustering in High Dimensions

Proc. VLDB Endow. Pub Date : 2023-03-01 DOI:10.14778/3587136.3587147

Kasper Overgaard Mortensen, Fatemeh Zardbani, M. A. Haque, S. Agustsson, D. Mottin, Philip Hofmann, Panagiotis Karras

{"title":"Marigold: Efficient k-means Clustering in High Dimensions","authors":"Kasper Overgaard Mortensen, Fatemeh Zardbani, M. A. Haque, S. Agustsson, D. Mottin, Philip Hofmann, Panagiotis Karras","doi":"10.14778/3587136.3587147","DOIUrl":null,"url":null,"abstract":"\n How can we efficiently and scalably cluster high-dimensional data? The\n k\n -means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (\n k\n -means for high-dimensional data), a scalable algorithm for\n k\n -means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to\n k\n -means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"10 1","pages":"1740-1748"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3587136.3587147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

How can we efficiently and scalably cluster high-dimensional data? The k -means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold ( k -means for high-dimensional data), a scalable algorithm for k -means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to k -means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.

查看原文本刊更多论文

Marigold:高效的高维k均值聚类

如何高效、可扩展地聚类高维数据?k均值算法通过迭代地减少聚类内的欧氏距离来聚类数据，直到收敛。虽然它从推荐引擎到图像分割都有应用，但由于需要反复计算点和质心之间的欧几里德距离，它在高维数据中的应用受到了阻碍。本文提出了一种可扩展的高维k均值聚类算法Marigold (k -means for high-dimensional data)。万寿菊李子距离的计算(i)紧距离边界格式;(ii)对一个多分辨率变换进行逐步计算;(3)利用三角不等式。据我们所知，迄今为止，这种修剪技术的武库尚未应用于k -means。我们的工作是由时间临界角分辨光谱学(ARPES)实验激发的，在该实验中，实时检测高维光谱中的簇是至关重要的。在对真实世界数据集的彻底实验研究中，我们证明了Marigold有效地聚类高维数据，比现有技术实现了大约一个数量级的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量