一种通用迭代聚类算法。

IF 2.1 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining Pub Date : 2022-08-01 Epub Date: 2022-01-31 DOI:10.1002/sam.11573

Ziqiang Lin, Eugene Laska, Carole Siegel

{"title":"一种通用迭代聚类算法。","authors":"Ziqiang Lin, Eugene Laska, Carole Siegel","doi":"10.1002/sam.11573","DOIUrl":null,"url":null,"abstract":"The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 4","pages":"433-446"},"PeriodicalIF":2.1000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c8/08/SAM-15-433.PMC9438941.pdf","citationCount":"1","resultStr":"{\"title\":\"A General Iterative Clustering Algorithm.\",\"authors\":\"Ziqiang Lin, Eugene Laska, Carole Siegel\",\"doi\":\"10.1002/sam.11573\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.\",\"PeriodicalId\":48684,\"journal\":{\"name\":\"Statistical Analysis and Data Mining\",\"volume\":\"15 4\",\"pages\":\"433-446\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2022-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c8/08/SAM-15-433.PMC9438941.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Analysis and Data Mining\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1002/sam.11573\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/1/31 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/sam.11573","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/1/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

摘要

未标记单元的聚类分析的质量取决于单元之间不相似性度量的质量。与欧几里得距离等与数据无关的几何度量相比，数据相关的不相似度更为客观。正如Breiman所建议的那样，许多数据驱动的方法都是基于决策树集成的，比如随机森林(RF)，它产生一个接近矩阵，可以很容易地转化为不相似矩阵。射频可以用标签来区分真实数据和合成数据。将得到的不相似矩阵输入到聚类程序中，并为单元分配与聚类隶属度相对应的标签。提出了一种通用迭代聚类(GIC)算法，改进了基射频的接近矩阵和聚类。聚类标签用于生成一个新的RF，产生一个更新的接近矩阵，该矩阵被输入到聚类程序中。这个过程不断重复，直到收敛。相同的过程可以用于许多基本过程，例如极端随机树集成。我们使用基准和模拟数据集来评估GIC算法的性能。由Silhouette Score测量的属性实质上优于基本聚类算法。GIC软件包已在R: https://cran.r-project.org/web/packages/GIC/index.html发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A General Iterative Clustering Algorithm.

查看原文本刊更多论文

A General Iterative Clustering Algorithm.

The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.