Adaptive Neighbors Graph Learning for Large-Scale Data Clustering using Vector Quantization and Self-Regularization

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2024-09-19 DOI:10.1016/j.asoc.2024.112256

{"title":"Adaptive Neighbors Graph Learning for Large-Scale Data Clustering using Vector Quantization and Self-Regularization","authors":"","doi":"10.1016/j.asoc.2024.112256","DOIUrl":null,"url":null,"abstract":"<div><div>In traditional adaptive neighbors graph learning (ANGL)-based clustering, the time complexity is more than <math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math>, where <math><mi>n</mi></math> is the number of data points, which is not scalable for large-scale data problems in real applications. Subsequently, ANGL adds a balance regularization to its objective function to avoid the sparse over-fitting problem in the learned similarity graph matrix. Still, the regularization may leads to many weak connections between data points in different clusters. To address these problems, we propose a new fast clustering method, namely, Adaptive Neighbors Graph Learning for Large-Scale Data Clustering using Vector Quantization and Self-Regularization (ANGL-LDC), to perform vector quantization (VQ) on original data and feed the obtained VQ data as the input in the <math><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow></math> similarity graph matrix learning. Hence, the <math><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow></math> similarity graph matrix learning problem is simplified to weighted <math><mrow><mi>m</mi><mo>×</mo><mi>m</mi></mrow></math> <math><mrow><mo>(</mo><mi>m</mi><mo>≪</mo><mi>n</mi><mo>)</mo></mrow></math> graph learning problem, where <math><mi>m</mi></math> is the number of distinct points and weight is the duplicate times of distinct points in VQ data. Consequently, the time complexity of ANGL-LDC is much lower than that of ANGL. At the same time, we propose a new ANGL objective function with a graph connection self-regularization mechanism, where the ANGL-LDC objective function will get an infinity value if the value of one graph connection is equal to 1. Therefore, ANGL-LDC naturally avoids obtaining the sparse over-fitting problem since we need to minimize the value of ANGL-LDC’s objective function. Experimental results on synthetic and real-world datasets demonstrate the scalability and effectiveness of ANGL-LDC.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494624010305","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In traditional adaptive neighbors graph learning (ANGL)-based clustering, the time complexity is more than

O (n^{2})

, where

n

is the number of data points, which is not scalable for large-scale data problems in real applications. Subsequently, ANGL adds a balance regularization to its objective function to avoid the sparse over-fitting problem in the learned similarity graph matrix. Still, the regularization may leads to many weak connections between data points in different clusters. To address these problems, we propose a new fast clustering method, namely, Adaptive Neighbors Graph Learning for Large-Scale Data Clustering using Vector Quantization and Self-Regularization (ANGL-LDC), to perform vector quantization (VQ) on original data and feed the obtained VQ data as the input in the

n \times n

similarity graph matrix learning. Hence, the

n \times n

similarity graph matrix learning problem is simplified to weighted

m \times m

(m ≪ n)

graph learning problem, where

m

is the number of distinct points and weight is the duplicate times of distinct points in VQ data. Consequently, the time complexity of ANGL-LDC is much lower than that of ANGL. At the same time, we propose a new ANGL objective function with a graph connection self-regularization mechanism, where the ANGL-LDC objective function will get an infinity value if the value of one graph connection is equal to 1. Therefore, ANGL-LDC naturally avoids obtaining the sparse over-fitting problem since we need to minimize the value of ANGL-LDC’s objective function. Experimental results on synthetic and real-world datasets demonstrate the scalability and effectiveness of ANGL-LDC.

查看原文本刊更多论文

利用矢量量化和自规则化进行大规模数据聚类的自适应邻域图学习

在传统的基于自适应邻接图学习（ANGL）的聚类中，时间复杂度大于 O(n2)，其中 n 是数据点的数量，这对于实际应用中的大规模数据问题来说是不可扩展的。随后，ANGL 在其目标函数中加入了平衡正则化，以避免学习到的相似性图矩阵中的稀疏过拟合问题。然而，正则化可能会导致不同聚类中的数据点之间存在许多弱连接。为了解决这些问题，我们提出了一种新的快速聚类方法，即使用矢量量化和自规整的大规模数据聚类自适应邻域图学习（ANGL-LDC），对原始数据进行矢量量化（VQ），并将获得的 VQ 数据作为 n×n 相似性图矩阵学习的输入。因此，n×n 相似性图矩阵学习问题被简化为加权 m×m (m≪n) 图学习问题，其中 m 是不同点的数量，权重是 VQ 数据中不同点的重复次数。因此，ANGL-LDC 的时间复杂度远远低于 ANGL。同时，我们提出了一种具有图连接自规则化机制的新 ANGL 目标函数，即如果一个图连接的值等于 1，ANGL-LDC 目标函数的值将为无穷大。因此，由于我们需要最小化 ANGL-LDC 目标函数的值，ANGL-LDC 自然避免了获得稀疏过拟合问题。在合成数据集和真实数据集上的实验结果证明了 ANGL-LDC 的可扩展性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.