基于近似k近邻的快速密度峰聚类算法

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-07-16 DOI:10.1109/TKDE.2025.3589794

Shifei Ding;Chao Li;Xiao Xu;Lili Guo;Ling Ding;Xindong Wu

{"title":"基于近似k近邻的快速密度峰聚类算法","authors":"Shifei Ding;Chao Li;Xiao Xu;Lili Guo;Ling Ding;Xindong Wu","doi":"10.1109/TKDE.2025.3589794","DOIUrl":null,"url":null,"abstract":"Density peaks clustering (DPC) is one of the density-based clustering algorithms and has been widely studied and applied in recent years because of its unique parameter, non-iteration and good robustness. However, it cannot effectively identify the cluster centers, and time and space complexities are too high. To this end, this paper proposes a fast density peaks clustering algorithm based on approximate <italic>k</i>-nearest neighbors (FDPAN). Firstly, it uses Balanced K-means based Hierarchical K-means (BKHK) method to partition the data and quickly find the approximate <italic>k</i>-nearest neighbors (AKNN), improving the algorithm’s efficiency on large-scale high-dimensional data. Meanwhile, three-way clustering is used to improve the neighbor search of the boundary points of the partition. Then, the local density and relative distance of DPC are recalculated by AKNN. Finally, according to the similar density chain, the connected high-density points are labeled while searching for the cluster center, and the remaining points are assigned to the clusters where their nearest higher-density points are located. Theoretical analysis and experiments on synthetic and real datasets show that FDPAN can obtain higher clustering results and shorten the operation time on large-scale high-dimensional data compared with DPC and its variants.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5878-5889"},"PeriodicalIF":10.4000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fast Density Peaks Clustering Algorithm Based on Approximate k-Nearest Neighbors\",\"authors\":\"Shifei Ding;Chao Li;Xiao Xu;Lili Guo;Ling Ding;Xindong Wu\",\"doi\":\"10.1109/TKDE.2025.3589794\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Density peaks clustering (DPC) is one of the density-based clustering algorithms and has been widely studied and applied in recent years because of its unique parameter, non-iteration and good robustness. However, it cannot effectively identify the cluster centers, and time and space complexities are too high. To this end, this paper proposes a fast density peaks clustering algorithm based on approximate <italic>k</i>-nearest neighbors (FDPAN). Firstly, it uses Balanced K-means based Hierarchical K-means (BKHK) method to partition the data and quickly find the approximate <italic>k</i>-nearest neighbors (AKNN), improving the algorithm’s efficiency on large-scale high-dimensional data. Meanwhile, three-way clustering is used to improve the neighbor search of the boundary points of the partition. Then, the local density and relative distance of DPC are recalculated by AKNN. Finally, according to the similar density chain, the connected high-density points are labeled while searching for the cluster center, and the remaining points are assigned to the clusters where their nearest higher-density points are located. Theoretical analysis and experiments on synthetic and real datasets show that FDPAN can obtain higher clustering results and shorten the operation time on large-scale high-dimensional data compared with DPC and its variants.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 10\",\"pages\":\"5878-5889\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11081885/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11081885/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

密度峰聚类（DPC）是一种基于密度的聚类算法，由于其参数独特、不迭代、鲁棒性好等优点，近年来得到了广泛的研究和应用。但是，它不能有效地识别集群中心，且时间和空间复杂度太高。为此，本文提出了一种基于近似k近邻（FDPAN）的快速密度峰聚类算法。首先，采用基于Balanced K-means的分层K-means （BKHK）方法对数据进行分割，快速找到近似k近邻（AKNN），提高了算法在大规模高维数据处理上的效率；同时，采用三向聚类方法改进了分区边界点的邻域搜索。然后，利用AKNN重新计算DPC的局部密度和相对距离。最后，根据相似密度链，在搜索聚类中心的同时对连通的高密度点进行标记，剩余的点被分配到离它们最近的高密度点所在的聚类中。在合成数据集和真实数据集上的理论分析和实验表明，与DPC及其变体相比，FDPAN在大规模高维数据上可以获得更高的聚类结果，并缩短运算时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fast Density Peaks Clustering Algorithm Based on Approximate k-Nearest Neighbors

Density peaks clustering (DPC) is one of the density-based clustering algorithms and has been widely studied and applied in recent years because of its unique parameter, non-iteration and good robustness. However, it cannot effectively identify the cluster centers, and time and space complexities are too high. To this end, this paper proposes a fast density peaks clustering algorithm based on approximate k-nearest neighbors (FDPAN). Firstly, it uses Balanced K-means based Hierarchical K-means (BKHK) method to partition the data and quickly find the approximate k-nearest neighbors (AKNN), improving the algorithm’s efficiency on large-scale high-dimensional data. Meanwhile, three-way clustering is used to improve the neighbor search of the boundary points of the partition. Then, the local density and relative distance of DPC are recalculated by AKNN. Finally, according to the similar density chain, the connected high-density points are labeled while searching for the cluster center, and the remaining points are assigned to the clusters where their nearest higher-density points are located. Theoretical analysis and experiments on synthetic and real datasets show that FDPAN can obtain higher clustering results and shorten the operation time on large-scale high-dimensional data compared with DPC and its variants.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.