有多少集群存在?通过在R中实现的最大聚类相似性来回答

Q3 Medicine

Biostatistics and Epidemiology Pub Date : 2019-01-01 DOI:10.1080/24709360.2019.1615770

A. Albatineh, M. Wilcox, B. Zogheib, M. Niewiadomska-Bugaj

{"title":"有多少集群存在?通过在R中实现的最大聚类相似性来回答","authors":"A. Albatineh, M. Wilcox, B. Zogheib, M. Niewiadomska-Bugaj","doi":"10.1080/24709360.2019.1615770","DOIUrl":null,"url":null,"abstract":"Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.","PeriodicalId":37240,"journal":{"name":"Biostatistics and Epidemiology","volume":"3 1","pages":"62 - 79"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/24709360.2019.1615770","citationCount":"0","resultStr":"{\"title\":\"How many clusters exist? Answer via maximum clustering similarity implemented in R\",\"authors\":\"A. Albatineh, M. Wilcox, B. Zogheib, M. Niewiadomska-Bugaj\",\"doi\":\"10.1080/24709360.2019.1615770\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.\",\"PeriodicalId\":37240,\"journal\":{\"name\":\"Biostatistics and Epidemiology\",\"volume\":\"3 1\",\"pages\":\"62 - 79\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1080/24709360.2019.1615770\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biostatistics and Epidemiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/24709360.2019.1615770\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics and Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24709360.2019.1615770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

找出数据集中的聚类数量被认为是聚类分析的基本问题之一。本文通过MCSim软件包将最大聚类相似性（MCS）集成到R统计软件中，以寻找最优聚类数。两种聚类方法之间的相似性是在相同数量的聚类下计算的，使用Rand[聚类方法评估的客观标准。J Am Stat Assoc.1971；66:846–850.]和Jaccard[高山区植物群的分布。新植物学家。1912；11:37–50.]指数，对偶然一致性进行校正。指数以最高频率达到最大值的聚类数量是最优聚类数量的候选者。与其他标准不同，MCS可用于循环数据。在MCSim中实现了R中存在的七种聚类算法。使用校正的相似性指数生成聚类数量与聚类相似性的关系图。生成相似性指数的值和聚类树（树状图）。给出了几个例子，包括模拟、真实和循环数据集，以展示MCSim是如何在实践中成功工作的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How many clusters exist? Answer via maximum clustering similarity implemented in R

Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biostatistics and Epidemiology Medicine-Health Informatics

CiteScore

1.80

自引率

0.00%

发文量