Discriminative local affine-hull clustering for high-dimensional data

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-07-23 DOI:10.1016/j.array.2025.100465

Yu-Feng Yu , Jiali Luo , Xuanyi Chen , Yingchao Cheng , Yulin He , Joshua Zhexue Huang

{"title":"Discriminative local affine-hull clustering for high-dimensional data","authors":"Yu-Feng Yu , Jiali Luo , Xuanyi Chen , Yingchao Cheng , Yulin He , Joshua Zhexue Huang","doi":"10.1016/j.array.2025.100465","DOIUrl":null,"url":null,"abstract":"<div><div>Clustering high-dimensional data presents a critical technical challenge due to the curse of dimensionality, feature redundancy, and sensitivity to noise—issues that significantly degrade clustering accuracy in applications such as gene expression analysis, image recognition, and anomaly detection. Existing solutions often rely on dimensionality reduction techniques that risk discarding discriminative features, or on deep learning methods that require large-scale training data and suffer from poor interpretability. To address these limitations, this study proposes a novel discriminative subspace clustering algorithm that avoids traditional dimensionality reduction and instead operates directly in the high-dimensional space. Our method partitions the sample space into multiple local affine hulls and introduces a discriminative geometric distance metric that accounts for both relevant and irrelevant subspaces. Specifically, the model measures the ratio between a query sample’s proximity to its class-specific affine hull and its distance from unrelated class subspaces. This dual-space modeling improves both intra-class compactness and inter-class separation. To ensure computational efficiency, we reformulate distance calculations as matrix multiplications and leverage SVD for subspace projection, enabling scalable performance across large datasets. Extensive experiments on seven benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art clustering algorithms. It achieves up to 92.60% accuracy on MNIST and maintains high robustness on sparse and noisy data, validating its effectiveness for high-dimensional clustering tasks. This work contributes a geometrically interpretable and computationally efficient framework that closes a long-standing gap in unsupervised learning under high-dimensional constraints.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100465"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S259000562500092X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering high-dimensional data presents a critical technical challenge due to the curse of dimensionality, feature redundancy, and sensitivity to noise—issues that significantly degrade clustering accuracy in applications such as gene expression analysis, image recognition, and anomaly detection. Existing solutions often rely on dimensionality reduction techniques that risk discarding discriminative features, or on deep learning methods that require large-scale training data and suffer from poor interpretability. To address these limitations, this study proposes a novel discriminative subspace clustering algorithm that avoids traditional dimensionality reduction and instead operates directly in the high-dimensional space. Our method partitions the sample space into multiple local affine hulls and introduces a discriminative geometric distance metric that accounts for both relevant and irrelevant subspaces. Specifically, the model measures the ratio between a query sample’s proximity to its class-specific affine hull and its distance from unrelated class subspaces. This dual-space modeling improves both intra-class compactness and inter-class separation. To ensure computational efficiency, we reformulate distance calculations as matrix multiplications and leverage SVD for subspace projection, enabling scalable performance across large datasets. Extensive experiments on seven benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art clustering algorithms. It achieves up to 92.60% accuracy on MNIST and maintains high robustness on sparse and noisy data, validating its effectiveness for high-dimensional clustering tasks. This work contributes a geometrically interpretable and computationally efficient framework that closes a long-standing gap in unsupervised learning under high-dimensional constraints.

查看原文本刊更多论文

高维数据的判别局部仿射-船体聚类

聚类高维数据是一项关键的技术挑战，因为在基因表达分析、图像识别和异常检测等应用中，高维数据的维数、特征冗余和对噪声的敏感性会显著降低聚类的准确性。现有的解决方案通常依赖于有可能丢弃判别特征的降维技术，或者依赖于需要大规模训练数据且可解释性差的深度学习方法。为了解决这些限制，本研究提出了一种新的判别子空间聚类算法，该算法避免了传统的降维，而是直接在高维空间中进行操作。我们的方法将样本空间划分为多个局部仿射壳，并引入一个区分相关和不相关子空间的几何距离度量。具体来说，该模型测量查询样本与其类特定的仿射壳的接近程度与其与不相关的类子空间的距离之间的比率。这种双空间建模提高了类内的紧凑性和类间的分离性。为了确保计算效率，我们将距离计算重新制定为矩阵乘法，并利用SVD进行子空间投影，从而实现跨大型数据集的可扩展性能。在七个基准数据集上进行的大量实验表明，所提出的方法始终优于最先进的聚类算法。该方法在MNIST上达到了92.60%的准确率，并且在稀疏和噪声数据上保持了较高的鲁棒性，验证了其对高维聚类任务的有效性。这项工作提供了一个几何上可解释和计算效率高的框架，填补了高维约束下无监督学习的长期空白。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊