{"title":"基于持久同源性的改进K-means算法","authors":"NingNing Peng, Shanjunshu Gao, Xingzi Yin, Xueyan Zhan","doi":"10.1016/j.jocs.2025.102680","DOIUrl":null,"url":null,"abstract":"<div><div>The traditional K-means algorithm has several limitations, including sensitivity to initial center, unstable clustering results, local optimal clustering results, and a large number of iterations. In this paper, we propose an improved clustering algorithm called PH-K-means that utilizes the persistent homology to identify k clusters in the data set. The algorithm calculates the length of the longest Betti number to obtain k Betti numbers, which represent the k clusters respectively. The data is then output in k Betty numbers, and the average value of the data in each Betti number is used as the initialization center of k clusters. The algorithm iterates until the difference of the square sum of the errors in the adjacent two clusters is less than the threshold value. The PH-K-means algorithm is tested on seven common data sets, and the results show that it has high accuracy, stable clustering results, and requires fewer iterations than traditional K-means, K-means++, UK-means, and K-means algorithms.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"91 ","pages":"Article 102680"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An improved K-means algorithm based on persistent homology\",\"authors\":\"NingNing Peng, Shanjunshu Gao, Xingzi Yin, Xueyan Zhan\",\"doi\":\"10.1016/j.jocs.2025.102680\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The traditional K-means algorithm has several limitations, including sensitivity to initial center, unstable clustering results, local optimal clustering results, and a large number of iterations. In this paper, we propose an improved clustering algorithm called PH-K-means that utilizes the persistent homology to identify k clusters in the data set. The algorithm calculates the length of the longest Betti number to obtain k Betti numbers, which represent the k clusters respectively. The data is then output in k Betty numbers, and the average value of the data in each Betti number is used as the initialization center of k clusters. The algorithm iterates until the difference of the square sum of the errors in the adjacent two clusters is less than the threshold value. The PH-K-means algorithm is tested on seven common data sets, and the results show that it has high accuracy, stable clustering results, and requires fewer iterations than traditional K-means, K-means++, UK-means, and K-means algorithms.</div></div>\",\"PeriodicalId\":48907,\"journal\":{\"name\":\"Journal of Computational Science\",\"volume\":\"91 \",\"pages\":\"Article 102680\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computational Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877750325001577\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325001577","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
An improved K-means algorithm based on persistent homology
The traditional K-means algorithm has several limitations, including sensitivity to initial center, unstable clustering results, local optimal clustering results, and a large number of iterations. In this paper, we propose an improved clustering algorithm called PH-K-means that utilizes the persistent homology to identify k clusters in the data set. The algorithm calculates the length of the longest Betti number to obtain k Betti numbers, which represent the k clusters respectively. The data is then output in k Betty numbers, and the average value of the data in each Betti number is used as the initialization center of k clusters. The algorithm iterates until the difference of the square sum of the errors in the adjacent two clusters is less than the threshold value. The PH-K-means algorithm is tested on seven common data sets, and the results show that it has high accuracy, stable clustering results, and requires fewer iterations than traditional K-means, K-means++, UK-means, and K-means algorithms.
期刊介绍:
Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory.
The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation.
This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods.
Computational science typically unifies three distinct elements:
• Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous);
• Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems;
• Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).