An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering

Pub Date : 2021-09-03 DOI:10.1111/anzs.12338
Christian Hennig, Pietro Coretto
{"title":"An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering","authors":"Christian Hennig,&nbsp;Pietro Coretto","doi":"10.1111/anzs.12338","DOIUrl":null,"url":null,"abstract":"<p>We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto &amp; Hennig, <i>Journal of the American Statistical Association</i> <b>111</b>, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic <i>Q</i> that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to <i>Q</i>. The simplicity of a model is assessed by a measure <i>S</i> that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of <i>Q</i> is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on two real data sets.</p>","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1111/anzs.12338","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, Journal of the American Statistical Association 111, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic Q that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to Q. The simplicity of a model is assessed by a measure S that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of Q is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on two real data sets.

Abstract Image

分享
查看原文
基于OTRIMLE鲁棒高斯混合聚类的聚类数量决定的充分性方法
我们介绍了一种确定集群数量的新方法。该方法应用于最优调谐鲁棒不当极大似然估计(OTRIMLE;Coretto,Hennig,《美国统计协会杂志》(Journal of American Statistical Association),第111期,1648-1659期),他提出了一种高斯混合模型,该模型允许将观测结果归类为“噪声”,但它也可以应用于其他聚类方法。聚类的质量是通过统计量Q来评估的,该统计量Q测量聚类内分布与椭圆单峰分布的接近程度,椭圆单峰分布的唯一模式是在平均值中。这种非参数度量允许非高斯聚类,只要它们根据q具有良好的质量。模型的简单性由度量S评估,该度量S倾向于较少数量的聚类,除非额外的聚类可以大幅降低估计的噪声比例。然后选择最简单的模型,该模型适合于数据,因为其观察到的Q值不会显著大于从拟合模型真正生成的数据的预期值,可以通过参数自举来评估。在仿真研究和两个真实数据集上,将该方法与基于贝叶斯信息准则(BIC)和集成完全似然(ICL)的模型聚类方法进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信