{"title":"An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering","authors":"Christian Hennig, Pietro Coretto","doi":"10.1111/anzs.12338","DOIUrl":null,"url":null,"abstract":"<p>We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, <i>Journal of the American Statistical Association</i> <b>111</b>, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic <i>Q</i> that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to <i>Q</i>. The simplicity of a model is assessed by a measure <i>S</i> that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of <i>Q</i> is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on two real data sets.</p>","PeriodicalId":55428,"journal":{"name":"Australian & New Zealand Journal of Statistics","volume":"64 2","pages":"230-254"},"PeriodicalIF":0.8000,"publicationDate":"2021-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1111/anzs.12338","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australian & New Zealand Journal of Statistics","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12338","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 3
Abstract
We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, Journal of the American Statistical Association111, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic Q that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to Q. The simplicity of a model is assessed by a measure S that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of Q is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on two real data sets.
我们介绍了一种确定集群数量的新方法。该方法应用于最优调谐鲁棒不当极大似然估计(OTRIMLE;Coretto,Hennig,《美国统计协会杂志》(Journal of American Statistical Association),第111期,1648-1659期),他提出了一种高斯混合模型,该模型允许将观测结果归类为“噪声”,但它也可以应用于其他聚类方法。聚类的质量是通过统计量Q来评估的,该统计量Q测量聚类内分布与椭圆单峰分布的接近程度,椭圆单峰分布的唯一模式是在平均值中。这种非参数度量允许非高斯聚类,只要它们根据q具有良好的质量。模型的简单性由度量S评估,该度量S倾向于较少数量的聚类,除非额外的聚类可以大幅降低估计的噪声比例。然后选择最简单的模型,该模型适合于数据,因为其观察到的Q值不会显著大于从拟合模型真正生成的数据的预期值,可以通过参数自举来评估。在仿真研究和两个真实数据集上,将该方法与基于贝叶斯信息准则(BIC)和集成完全似然(ICL)的模型聚类方法进行了比较。
期刊介绍:
The Australian & New Zealand Journal of Statistics is an international journal managed jointly by the Statistical Society of Australia and the New Zealand Statistical Association. Its purpose is to report significant and novel contributions in statistics, ranging across articles on statistical theory, methodology, applications and computing. The journal has a particular focus on statistical techniques that can be readily applied to real-world problems, and on application papers with an Australasian emphasis. Outstanding articles submitted to the journal may be selected as Discussion Papers, to be read at a meeting of either the Statistical Society of Australia or the New Zealand Statistical Association.
The main body of the journal is divided into three sections.
The Theory and Methods Section publishes papers containing original contributions to the theory and methodology of statistics, econometrics and probability, and seeks papers motivated by a real problem and which demonstrate the proposed theory or methodology in that situation. There is a strong preference for papers motivated by, and illustrated with, real data.
The Applications Section publishes papers demonstrating applications of statistical techniques to problems faced by users of statistics in the sciences, government and industry. A particular focus is the application of newly developed statistical methodology to real data and the demonstration of better use of established statistical methodology in an area of application. It seeks to aid teachers of statistics by placing statistical methods in context.
The Statistical Computing Section publishes papers containing new algorithms, code snippets, or software descriptions (for open source software only) which enhance the field through the application of computing. Preference is given to papers featuring publically available code and/or data, and to those motivated by statistical methods for practical problems.