{"title":"A Complete Linkage Algorithm for Clustering Dynamic Datasets","authors":"Payel Banerjee, Amlan Chakrabarti, Tapas Kumar Ballabh","doi":"10.1007/s40010-024-00894-8","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, a vital challenge faced by experts in data science is analyzing the gigantic volume of data coming at high speed. This data avalanche is not only difficult to collect but also demands high time and memory while getting processed. Clustering is a well-known solution to this problem as it not only helps in shrinking the database but also helps in gaining valuable insights from a completely unlabelled dataset. Complete Linkage Clustering is a well-known Hierarchical Clustering algorithm suitable for generating small and highly cohesive clusters but suffers from the disadvantage of high convergence time. The traditional methods require the complete dataset in advance to take a clustering decision which makes it unsuitable for clustering both large and dynamic datasets where new data points are added frequently. This is because, for every addition of data, the entire dataset will be processed again for taking a clustering decision. Our paper presents a fast Complete Linkage Clustering algorithm that uses triangle inequality to avoid a lot of redundant distance calculations making the algorithm faster and suitable for clustering both large and dynamic databases. Experiments have been conducted with various real-world datasets and Adjusted Rand Index has been used for comparing the result with the original Complete Linkage algorithm. The experimental result confirms the effectiveness of our algorithm for both static and dynamic databases.</p></div>","PeriodicalId":744,"journal":{"name":"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences","volume":"94 5","pages":"471 - 486"},"PeriodicalIF":0.8000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences","FirstCategoryId":"103","ListUrlMain":"https://link.springer.com/article/10.1007/s40010-024-00894-8","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, a vital challenge faced by experts in data science is analyzing the gigantic volume of data coming at high speed. This data avalanche is not only difficult to collect but also demands high time and memory while getting processed. Clustering is a well-known solution to this problem as it not only helps in shrinking the database but also helps in gaining valuable insights from a completely unlabelled dataset. Complete Linkage Clustering is a well-known Hierarchical Clustering algorithm suitable for generating small and highly cohesive clusters but suffers from the disadvantage of high convergence time. The traditional methods require the complete dataset in advance to take a clustering decision which makes it unsuitable for clustering both large and dynamic datasets where new data points are added frequently. This is because, for every addition of data, the entire dataset will be processed again for taking a clustering decision. Our paper presents a fast Complete Linkage Clustering algorithm that uses triangle inequality to avoid a lot of redundant distance calculations making the algorithm faster and suitable for clustering both large and dynamic databases. Experiments have been conducted with various real-world datasets and Adjusted Rand Index has been used for comparing the result with the original Complete Linkage algorithm. The experimental result confirms the effectiveness of our algorithm for both static and dynamic databases.