MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.

IF 4.3 3区 材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
ACS Applied Electronic Materials Pub Date : 2020-02-01 Epub Date: 2020-02-17 DOI:10.1214/18-aos1794
Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen
{"title":"MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.","authors":"Florentina Bunea,&nbsp;Christophe Giraud,&nbsp;Xi Luo,&nbsp;Martin Royer,&nbsp;Nicolas Verzelen","doi":"10.1214/18-aos1794","DOIUrl":null,"url":null,"abstract":"<p><p>The problem of variable clustering is that of estimating groups of similar components of a <i>p</i>-dimensional vector <i>X</i> = (<i>X</i> <sub>1</sub>, … , <i>X</i> <sub><i>p</i></sub> ) from <i>n</i> independent copies of <i>X</i>. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of <i>G</i>-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a <i>G</i>-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to <i>G</i>-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular <i>K</i>-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9286061/pdf/nihms-1765231.pdf","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/18-aos1794","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 23

Abstract

The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X 1, … , X p ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

模型辅助变量聚类:最小最大最优恢复和算法。
变量聚类的问题是从n个X的独立副本中估计p维向量X = (x1,…,xp)的相似组件组。存在大量返回数据相关变量组的算法,但它们的解释仅限于产生它们的算法。另一种选择是基于模型的聚类,在这种方法中,首先定义相对于嵌入相似性概念的模型的总体级聚类。为这种模型量身定制的算法产生具有明确统计解释的估计聚类。我们在此采用这一观点,并引入一类g块协方差模型作为变量聚类的背景模型。在这种模型中,如果集群中的两个变量与所有其他变量具有相似的关联,则认为它们相似。例如,当一组变量是同一潜在因素的噪声破坏版本时,就会出现这种情况。我们量化了从g块协方差模型中产生的聚类数据的难度,根据两个相关但不同的聚类分离指标进行测量。我们推导了极大极小聚类分离阈值,该阈值是度量值,低于此值,任何算法都无法准确恢复模型定义的聚类,并表明它们对于两个度量是不同的。因此,我们开发了针对g块协方差模型的两种算法,COD和PECOK,并研究了它们相对于每个度量的最小最优性。值得独立关注的是,PECOK算法的分析是基于流行的K-means算法的修正凸松弛的,它为变量聚类的这类算法提供了第一个统计分析。此外,我们将我们的方法与另一种流行的聚类方法谱聚类进行了比较。广泛的模拟研究,以及我们的数据分析,证实了我们方法的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
4.30%
发文量
567
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信