Tuning-free sparse clustering via alternating hard-thresholding

IF 1.4 3区数学 Q2 STATISTICS & PROBABILITY

Journal of Multivariate Analysis Pub Date : 2024-05-15 DOI:10.1016/j.jmva.2024.105330

Wei Dong , Chen Xu , Jinhan Xie , Niansheng Tang

{"title":"Tuning-free sparse clustering via alternating hard-thresholding","authors":"Wei Dong , Chen Xu , Jinhan Xie , Niansheng Tang","doi":"10.1016/j.jmva.2024.105330","DOIUrl":null,"url":null,"abstract":"<div><p>Model-based clustering is a commonly-used technique to partition heterogeneous data into homogeneous groups. When the analysis is to be conducted with a large number of features, analysts face simultaneous challenges in model interpretability, clustering accuracy, and computational efficiency. Several Bayesian and penalization methods have been proposed to select important features for model-based clustering. However, the performance of those methods relies on a careful algorithmic tuning, which can be time-consuming for high-dimensional cases. In this paper, we propose a new sparse clustering method based on alternating hard-thresholding. The new method is conceptually simple and tuning-free. With a user-specified sparsity level, it efficiently detects a set of key features by eliminating a large number of features that are less useful for clustering. Based on the selected key features, one can readily obtain an effective clustering of the original high-dimensional data under a general sparse covariance structure. Under mild conditions, we show that the new method leads to clusters with a misclassification rate consistent to the optimal rate as if the underlying true model were used. The promising performance of the new method is supported by both simulated and real data examples.</p></div>","PeriodicalId":16431,"journal":{"name":"Journal of Multivariate Analysis","volume":"203 ","pages":"Article 105330"},"PeriodicalIF":1.4000,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Multivariate Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0047259X2400037X","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Model-based clustering is a commonly-used technique to partition heterogeneous data into homogeneous groups. When the analysis is to be conducted with a large number of features, analysts face simultaneous challenges in model interpretability, clustering accuracy, and computational efficiency. Several Bayesian and penalization methods have been proposed to select important features for model-based clustering. However, the performance of those methods relies on a careful algorithmic tuning, which can be time-consuming for high-dimensional cases. In this paper, we propose a new sparse clustering method based on alternating hard-thresholding. The new method is conceptually simple and tuning-free. With a user-specified sparsity level, it efficiently detects a set of key features by eliminating a large number of features that are less useful for clustering. Based on the selected key features, one can readily obtain an effective clustering of the original high-dimensional data under a general sparse covariance structure. Under mild conditions, we show that the new method leads to clusters with a misclassification rate consistent to the optimal rate as if the underlying true model were used. The promising performance of the new method is supported by both simulated and real data examples.

查看原文本刊更多论文

通过交替硬阈值进行无调谐稀疏聚类

基于模型的聚类是将异质数据划分为同质组的常用技术。当需要使用大量特征进行分析时，分析人员同时面临着模型可解释性、聚类准确性和计算效率方面的挑战。目前已经提出了几种贝叶斯方法和惩罚方法来为基于模型的聚类选择重要特征。然而，这些方法的性能依赖于仔细的算法调整，这对于高维情况来说可能非常耗时。在本文中，我们提出了一种基于交替硬阈值的新稀疏聚类方法。新方法概念简单，无需调整。在用户指定的稀疏程度下，它能通过剔除大量对聚类作用较小的特征，高效地检测出一组关键特征。根据所选的关键特征，我们可以在一般稀疏协方差结构下轻松获得原始高维数据的有效聚类。在温和的条件下，我们发现新方法得到的聚类的误分类率与最佳误分类率一致，就像使用了底层真实模型一样。模拟和真实数据实例都证明了新方法的良好性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Multivariate Analysis 数学-统计学与概率论

CiteScore

2.40

自引率

25.00%

发文量

108

审稿时长

74 days

期刊介绍： Founded in 1971, the Journal of Multivariate Analysis (JMVA) is the central venue for the publication of new, relevant methodology and particularly innovative applications pertaining to the analysis and interpretation of multidimensional data. The journal welcomes contributions to all aspects of multivariate data analysis and modeling, including cluster analysis, discriminant analysis, factor analysis, and multidimensional continuous or discrete distribution theory. Topics of current interest include, but are not limited to, inferential aspects of Copula modeling Functional data analysis Graphical modeling High-dimensional data analysis Image analysis Multivariate extreme-value theory Sparse modeling Spatial statistics.