MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.

IF 3.2 1区数学 Q1 STATISTICS & PROBABILITY

Annals of Statistics Pub Date : 2020-02-01 Epub Date: 2020-02-17 DOI:10.1214/18-aos1794

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen

{"title":"MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.","authors":"Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen","doi":"10.1214/18-aos1794","DOIUrl":null,"url":null,"abstract":"The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X 1, … , X p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.","PeriodicalId":8032,"journal":{"name":"Annals of Statistics","volume":" ","pages":"111-137"},"PeriodicalIF":3.2000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9286061/pdf/nihms-1765231.pdf","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/18-aos1794","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 23

Abstract

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X ₁, … , X _p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

查看原文本刊更多论文

模型辅助变量聚类:最小最大最优恢复和算法。

变量聚类的问题是从n个X的独立副本中估计p维向量X = (x1，…，xp)的相似组件组。存在大量返回数据相关变量组的算法，但它们的解释仅限于产生它们的算法。另一种选择是基于模型的聚类，在这种方法中，首先定义相对于嵌入相似性概念的模型的总体级聚类。为这种模型量身定制的算法产生具有明确统计解释的估计聚类。我们在此采用这一观点，并引入一类g块协方差模型作为变量聚类的背景模型。在这种模型中，如果集群中的两个变量与所有其他变量具有相似的关联，则认为它们相似。例如，当一组变量是同一潜在因素的噪声破坏版本时，就会出现这种情况。我们量化了从g块协方差模型中产生的聚类数据的难度，根据两个相关但不同的聚类分离指标进行测量。我们推导了极大极小聚类分离阈值，该阈值是度量值，低于此值，任何算法都无法准确恢复模型定义的聚类，并表明它们对于两个度量是不同的。因此，我们开发了针对g块协方差模型的两种算法，COD和PECOK，并研究了它们相对于每个度量的最小最优性。值得独立关注的是，PECOK算法的分析是基于流行的K-means算法的修正凸松弛的，它为变量聚类的这类算法提供了第一个统计分析。此外，我们将我们的方法与另一种流行的聚类方法谱聚类进行了比较。广泛的模拟研究，以及我们的数据分析，证实了我们方法的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Statistics 数学-统计学与概率论

CiteScore

9.30

自引率

8.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： The Annals of Statistics aim to publish research papers of highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The journal aims to cover all areas of statistics, especially mathematical statistics and applied & interdisciplinary statistics. Of course many of the best papers will touch on more than one of these general areas, because the discipline of statistics has deep roots in mathematics, and in substantive scientific fields.