动态数据集聚类的完整链接算法

IF 0.8 4区综合性期刊 Q3 MULTIDISCIPLINARY SCIENCES

Proceedings of the National Academy of Sciences, India Section A: Physical Sciences Pub Date : 2024-09-25 DOI:10.1007/s40010-024-00894-8

Payel Banerjee, Amlan Chakrabarti, Tapas Kumar Ballabh

{"title":"动态数据集聚类的完整链接算法","authors":"Payel Banerjee, Amlan Chakrabarti, Tapas Kumar Ballabh","doi":"10.1007/s40010-024-00894-8","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, a vital challenge faced by experts in data science is analyzing the gigantic volume of data coming at high speed. This data avalanche is not only difficult to collect but also demands high time and memory while getting processed. Clustering is a well-known solution to this problem as it not only helps in shrinking the database but also helps in gaining valuable insights from a completely unlabelled dataset. Complete Linkage Clustering is a well-known Hierarchical Clustering algorithm suitable for generating small and highly cohesive clusters but suffers from the disadvantage of high convergence time. The traditional methods require the complete dataset in advance to take a clustering decision which makes it unsuitable for clustering both large and dynamic datasets where new data points are added frequently. This is because, for every addition of data, the entire dataset will be processed again for taking a clustering decision. Our paper presents a fast Complete Linkage Clustering algorithm that uses triangle inequality to avoid a lot of redundant distance calculations making the algorithm faster and suitable for clustering both large and dynamic databases. Experiments have been conducted with various real-world datasets and Adjusted Rand Index has been used for comparing the result with the original Complete Linkage algorithm. The experimental result confirms the effectiveness of our algorithm for both static and dynamic databases.</p></div>","PeriodicalId":744,"journal":{"name":"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences","volume":"94 5","pages":"471 - 486"},"PeriodicalIF":0.8000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Complete Linkage Algorithm for Clustering Dynamic Datasets\",\"authors\":\"Payel Banerjee, Amlan Chakrabarti, Tapas Kumar Ballabh\",\"doi\":\"10.1007/s40010-024-00894-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, a vital challenge faced by experts in data science is analyzing the gigantic volume of data coming at high speed. This data avalanche is not only difficult to collect but also demands high time and memory while getting processed. Clustering is a well-known solution to this problem as it not only helps in shrinking the database but also helps in gaining valuable insights from a completely unlabelled dataset. Complete Linkage Clustering is a well-known Hierarchical Clustering algorithm suitable for generating small and highly cohesive clusters but suffers from the disadvantage of high convergence time. The traditional methods require the complete dataset in advance to take a clustering decision which makes it unsuitable for clustering both large and dynamic datasets where new data points are added frequently. This is because, for every addition of data, the entire dataset will be processed again for taking a clustering decision. Our paper presents a fast Complete Linkage Clustering algorithm that uses triangle inequality to avoid a lot of redundant distance calculations making the algorithm faster and suitable for clustering both large and dynamic databases. Experiments have been conducted with various real-world datasets and Adjusted Rand Index has been used for comparing the result with the original Complete Linkage algorithm. The experimental result confirms the effectiveness of our algorithm for both static and dynamic databases.</p></div>\",\"PeriodicalId\":744,\"journal\":{\"name\":\"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences\",\"volume\":\"94 5\",\"pages\":\"471 - 486\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2024-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s40010-024-00894-8\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the National Academy of Sciences, India Section A: Physical Sciences","FirstCategoryId":"103","ListUrlMain":"https://link.springer.com/article/10.1007/s40010-024-00894-8","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

近年来，数据科学专家面临的一个重要挑战是分析高速传输的海量数据。这种数据雪崩不仅难以收集，而且在处理过程中需要耗费大量时间和内存。聚类是解决这一问题的著名方法，因为它不仅有助于缩小数据库，还有助于从完全未标记的数据集中获得有价值的见解。完全关联聚类是一种著名的分层聚类算法，适用于生成小而高度内聚的聚类，但存在收敛时间长的缺点。传统方法需要事先获得完整的数据集才能做出聚类决策，因此不适合对频繁添加新数据点的大型动态数据集进行聚类。这是因为，每增加一个数据，都要重新处理整个数据集以做出聚类决策。我们的论文提出了一种快速的完全链接聚类算法，它使用三角形不等式来避免大量冗余的距离计算，从而使算法更快，并适用于大型和动态数据库的聚类。本文使用各种实际数据集进行了实验，并使用调整后的兰德指数（Adjusted Rand Index）将实验结果与原始的完全关联算法进行了比较。实验结果证实了我们的算法在静态和动态数据库中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A Complete Linkage Algorithm for Clustering Dynamic Datasets

查看原文本刊更多论文

A Complete Linkage Algorithm for Clustering Dynamic Datasets

In recent years, a vital challenge faced by experts in data science is analyzing the gigantic volume of data coming at high speed. This data avalanche is not only difficult to collect but also demands high time and memory while getting processed. Clustering is a well-known solution to this problem as it not only helps in shrinking the database but also helps in gaining valuable insights from a completely unlabelled dataset. Complete Linkage Clustering is a well-known Hierarchical Clustering algorithm suitable for generating small and highly cohesive clusters but suffers from the disadvantage of high convergence time. The traditional methods require the complete dataset in advance to take a clustering decision which makes it unsuitable for clustering both large and dynamic datasets where new data points are added frequently. This is because, for every addition of data, the entire dataset will be processed again for taking a clustering decision. Our paper presents a fast Complete Linkage Clustering algorithm that uses triangle inequality to avoid a lot of redundant distance calculations making the algorithm faster and suitable for clustering both large and dynamic databases. Experiments have been conducted with various real-world datasets and Adjusted Rand Index has been used for comparing the result with the original Complete Linkage algorithm. The experimental result confirms the effectiveness of our algorithm for both static and dynamic databases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊