Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen
{"title":"MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.","authors":"Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, Nicolas Verzelen","doi":"10.1214/18-aos1794","DOIUrl":null,"url":null,"abstract":"<p><p>The problem of variable clustering is that of estimating groups of similar components of a <i>p</i>-dimensional vector <i>X</i> = (<i>X</i> <sub>1</sub>, … , <i>X</i> <sub><i>p</i></sub> ) from <i>n</i> independent copies of <i>X</i>. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of <i>G</i>-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a <i>G</i>-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to <i>G</i>-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular <i>K</i>-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":" ","pages":"111-137"},"PeriodicalIF":4.3000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9286061/pdf/nihms-1765231.pdf","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/18-aos1794","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/2/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 23
Abstract
The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X1, … , Xp ) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.
期刊介绍:
ACS Applied Electronic Materials is an interdisciplinary journal publishing original research covering all aspects of electronic materials. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrate knowledge in the areas of materials science, engineering, optics, physics, and chemistry into important applications of electronic materials. Sample research topics that span the journal's scope are inorganic, organic, ionic and polymeric materials with properties that include conducting, semiconducting, superconducting, insulating, dielectric, magnetic, optoelectronic, piezoelectric, ferroelectric and thermoelectric.
Indexed/Abstracted:
Web of Science SCIE
Scopus
CAS
INSPEC
Portico