加法重叠分割聚类中确定最优重叠簇数的模型选择策略

IF 1.9 4区计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Journal of Classification Pub Date : 2022-01-17 DOI:10.1007/s00357-021-09409-1

Julian Rossbroich, Jeffrey Durieux, Tom F. Wilderjans

{"title":"加法重叠分割聚类中确定最优重叠簇数的模型选择策略","authors":"Julian Rossbroich, Jeffrey Durieux, Tom F. Wilderjans","doi":"10.1007/s00357-021-09409-1","DOIUrl":null,"url":null,"abstract":"In various scientific fields, researchers make use of partitioning methods (e.g., K-means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K-means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"28 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering\",\"authors\":\"Julian Rossbroich, Jeffrey Durieux, Tom F. Wilderjans\",\"doi\":\"10.1007/s00357-021-09409-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In various scientific fields, researchers make use of partitioning methods (e.g., K-means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K-means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.\",\"PeriodicalId\":50241,\"journal\":{\"name\":\"Journal of Classification\",\"volume\":\"28 1\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2022-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Classification\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00357-021-09409-1\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-021-09409-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 2

摘要

在各个科学领域，研究者利用划分方法(如K-means)通过变量数据揭示对象背后的结构机制。然而，在某些情况下，将对象分组到允许重叠的集群中(即，将对象分配给多个集群)可能会更好地表示底层集群结构。为了通过变量数据从目标中获得重叠目标聚类，可以使用Mirkin的ADditive PROfile clustering (ADPROCLUS)模型。执行ADPROCLUS时的一个主要挑战是确定数据基础上重叠簇的最佳数量，这涉及到一个模型选择问题。然而，到目前为止，这个问题还没有系统的研究，几乎没有在文献中找到关于ADPROCLUS合适的模型选择策略的指导方针。因此，在本文中，现有的几种K-means模型选择策略(a.o、CHull、Caliński-Harabasz、Krzanowski-Lai、平均轮廓宽度和Dunn指数以及信息论措施如AIC和BIC)和两种基于交叉验证的策略针对ADPROCLUS环境进行了定制，并在广泛的模拟研究中相互比较。结果表明，CHull优于所有其他模型选择策略，特别是当使用与ADPROCLUS的最小随机扩展相关的负对数似然作为(误)拟合度量时。对一种基于事后aic的模型选择策略的分析表明，当使用不同的(更合适的)ADPROCLUS模型复杂度定义时，可以获得更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Model Selection Strategies for Determining the Optimal Number of Overlapping Clusters in Additive Overlapping Partitional Clustering

In various scientific fields, researchers make use of partitioning methods (e.g., K-means) to disclose the structural mechanisms underlying object by variable data. In some instances, however, a grouping of objects into clusters that are allowed to overlap (i.e., assigning objects to multiple clusters) might lead to a better representation of the underlying clustering structure. To obtain an overlapping object clustering from object by variable data, Mirkin’s ADditive PROfile CLUStering (ADPROCLUS) model may be used. A major challenge when performing ADPROCLUS is to determine the optimal number of overlapping clusters underlying the data, which pertains to a model selection problem. Up to now, however, this problem has not been systematically investigated and almost no guidelines can be found in the literature regarding appropriate model selection strategies for ADPROCLUS. Therefore, in this paper, several existing model selection strategies for K-means (a.o., CHull, the Caliński-Harabasz, Krzanowski-Lai, Average Silhouette Width and Dunn Index and information-theoretic measures like AIC and BIC) and two cross-validation based strategies are tailored towards an ADPROCLUS context and are compared to each other in an extensive simulation study. The results demonstrate that CHull outperforms all other model selection strategies and this especially when the negative log-likelihood, which is associated with a minimal stochastic extension of ADPROCLUS, is used as (mis)fit measure. The analysis of a post hoc AIC-based model selection strategy revealed that better performance may be obtained when a different—more appropriate—definition of model complexity for ADPROCLUS is used.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Classification 数学-数学跨学科应用

CiteScore

3.60

自引率

5.00%

发文量

审稿时长

>12 weeks

期刊介绍： To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.