Napsu Karmitsa , Ville-Pekka Eronen , Marko M. Mäkelä , Tapio Pahikkala , Antti Airola
{"title":"大数据聚类的随机有限内存束算法","authors":"Napsu Karmitsa , Ville-Pekka Eronen , Marko M. Mäkelä , Tapio Pahikkala , Antti Airola","doi":"10.1016/j.patcog.2025.111654","DOIUrl":null,"url":null,"abstract":"<div><div>Clustering is a crucial task in data mining and machine learning. In this paper, we propose an efficient algorithm, <span>Big-Clust</span>, for solving minimum sum-of-squares clustering problems in large and big datasets. We first develop a novel stochastic limited memory bundle algorithm (<span>SLMBA</span>) for large-scale nonsmooth finite-sum optimization problems and then formulate the clustering problem accordingly. The <span>Big-Clust</span>algorithm — a stochastic adaptation of the incremental clustering methodology — aims to find the global or a high-quality local solution for the clustering problem. It detects good starting points, i.e., initial cluster centers, for the <span>SLMBA</span>, applied as an underlying solver. We evaluate <span>Big-Clust</span>on several real-world datasets with numerous data points and features, comparing its performance with other clustering algorithms designed for large and big data. Numerical results demonstrate the efficiency of the proposed algorithm and the high quality of the found solutions on par with the best existing methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111654"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stochastic limited memory bundle algorithm for clustering in big data\",\"authors\":\"Napsu Karmitsa , Ville-Pekka Eronen , Marko M. Mäkelä , Tapio Pahikkala , Antti Airola\",\"doi\":\"10.1016/j.patcog.2025.111654\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Clustering is a crucial task in data mining and machine learning. In this paper, we propose an efficient algorithm, <span>Big-Clust</span>, for solving minimum sum-of-squares clustering problems in large and big datasets. We first develop a novel stochastic limited memory bundle algorithm (<span>SLMBA</span>) for large-scale nonsmooth finite-sum optimization problems and then formulate the clustering problem accordingly. The <span>Big-Clust</span>algorithm — a stochastic adaptation of the incremental clustering methodology — aims to find the global or a high-quality local solution for the clustering problem. It detects good starting points, i.e., initial cluster centers, for the <span>SLMBA</span>, applied as an underlying solver. We evaluate <span>Big-Clust</span>on several real-world datasets with numerous data points and features, comparing its performance with other clustering algorithms designed for large and big data. Numerical results demonstrate the efficiency of the proposed algorithm and the high quality of the found solutions on par with the best existing methods.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"165 \",\"pages\":\"Article 111654\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325003140\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003140","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Stochastic limited memory bundle algorithm for clustering in big data
Clustering is a crucial task in data mining and machine learning. In this paper, we propose an efficient algorithm, Big-Clust, for solving minimum sum-of-squares clustering problems in large and big datasets. We first develop a novel stochastic limited memory bundle algorithm (SLMBA) for large-scale nonsmooth finite-sum optimization problems and then formulate the clustering problem accordingly. The Big-Clustalgorithm — a stochastic adaptation of the incremental clustering methodology — aims to find the global or a high-quality local solution for the clustering problem. It detects good starting points, i.e., initial cluster centers, for the SLMBA, applied as an underlying solver. We evaluate Big-Cluston several real-world datasets with numerous data points and features, comparing its performance with other clustering algorithms designed for large and big data. Numerical results demonstrate the efficiency of the proposed algorithm and the high quality of the found solutions on par with the best existing methods.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.