An improved K-means algorithm based on persistent homology

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science Pub Date : 2025-07-29 DOI:10.1016/j.jocs.2025.102680

NingNing Peng, Shanjunshu Gao, Xingzi Yin, Xueyan Zhan

{"title":"An improved K-means algorithm based on persistent homology","authors":"NingNing Peng, Shanjunshu Gao, Xingzi Yin, Xueyan Zhan","doi":"10.1016/j.jocs.2025.102680","DOIUrl":null,"url":null,"abstract":"<div><div>The traditional K-means algorithm has several limitations, including sensitivity to initial center, unstable clustering results, local optimal clustering results, and a large number of iterations. In this paper, we propose an improved clustering algorithm called PH-K-means that utilizes the persistent homology to identify k clusters in the data set. The algorithm calculates the length of the longest Betti number to obtain k Betti numbers, which represent the k clusters respectively. The data is then output in k Betty numbers, and the average value of the data in each Betti number is used as the initialization center of k clusters. The algorithm iterates until the difference of the square sum of the errors in the adjacent two clusters is less than the threshold value. The PH-K-means algorithm is tested on seven common data sets, and the results show that it has high accuracy, stable clustering results, and requires fewer iterations than traditional K-means, K-means++, UK-means, and K-means algorithms.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"91 ","pages":"Article 102680"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325001577","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The traditional K-means algorithm has several limitations, including sensitivity to initial center, unstable clustering results, local optimal clustering results, and a large number of iterations. In this paper, we propose an improved clustering algorithm called PH-K-means that utilizes the persistent homology to identify k clusters in the data set. The algorithm calculates the length of the longest Betti number to obtain k Betti numbers, which represent the k clusters respectively. The data is then output in k Betty numbers, and the average value of the data in each Betti number is used as the initialization center of k clusters. The algorithm iterates until the difference of the square sum of the errors in the adjacent two clusters is less than the threshold value. The PH-K-means algorithm is tested on seven common data sets, and the results show that it has high accuracy, stable clustering results, and requires fewer iterations than traditional K-means, K-means++, UK-means, and K-means algorithms.

查看原文本刊更多论文

基于持久同源性的改进K-means算法

传统的K-means算法存在对初始中心敏感、聚类结果不稳定、聚类结果局部最优、迭代量大等缺点。在本文中，我们提出了一种改进的聚类算法，称为PH-K-means，它利用持久同源性来识别数据集中的k个聚类。算法计算最长Betti数的长度，得到k个Betti数，分别代表k个聚类。然后以k个贝蒂数输出数据，每个贝蒂数中数据的平均值作为k个簇的初始化中心。算法迭代，直到相邻两个聚类的误差平方和之差小于阈值。在7个常用数据集上对PH-K-means算法进行了测试，结果表明，与传统的K-means、k -means++、UK-means和K-means算法相比，PH-K-means算法具有精度高、聚类结果稳定、迭代次数少等优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).