Quantitative evaluation of internal cluster validation indices using binary data sets

IF 2.2 3区 环境科学与生态学 Q2 ECOLOGY
Naghmeh Pakgohar, Attila Lengyel, Zoltán Botta-Dukát
{"title":"Quantitative evaluation of internal cluster validation indices using binary data sets","authors":"Naghmeh Pakgohar,&nbsp;Attila Lengyel,&nbsp;Zoltán Botta-Dukát","doi":"10.1111/jvs.13310","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Aims</h3>\n \n <p>Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Artificial binary data sets with equal- and unequal-sized well-separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty-seven clustering validation indices are evaluated including both geometric and non-geometric indices.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non-geometric indices, crispness and OptimClass performed best.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>We recommend using these best-performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.</p>\n </section>\n </div>","PeriodicalId":49965,"journal":{"name":"Journal of Vegetation Science","volume":"35 5","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jvs.13310","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Vegetation Science","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jvs.13310","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Aims

Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.

Methods

Artificial binary data sets with equal- and unequal-sized well-separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty-seven clustering validation indices are evaluated including both geometric and non-geometric indices.

Results

Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non-geometric indices, crispness and OptimClass performed best.

Conclusion

We recommend using these best-performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.

Abstract Image

利用二进制数据集对内部聚类验证指数进行定量评估
目的 不同的聚类方法通常会对同一数据集进行不同的分类。利用聚类验证指数可以从备选方案中选择 "最佳 "聚类解决方案。由于聚类验证指数(CVI)种类繁多,根据数据集和聚类算法选择最合适的指数具有挑战性。我们旨在评估不同的内部聚类验证指数。 方法 模拟具有大小相等和不相等的先验分离好的聚类的人工二进制数据集,然后添加三种水平的噪声。六种类型的数据集(两种群组大小 × 三种噪音水平)中的每种数据集都有 20 个重复集,并通过三种具有 Jaccard 差异性的聚类算法进行分析。评估了 27 个聚类验证指数,包括几何和非几何指数。 结果 尽管从理论上讲,所有的 CVI 都能区分好的分类和错误的分类,但只有少数 CVI 在有噪声数据时的表现符合预期。在聚类大小相等和不相等的情况下,Tau 和轮廓宽度都被证明是最佳的几何 CVI。在非几何指数中,清晰度和 OptimClass 表现最佳。 结论 我们建议使用这些表现最佳的 CVI。我们建议绘制 CVI 值与聚类数的对比图,因为没有尖锐的峰值意味着最大值的位置不确定。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Vegetation Science
Journal of Vegetation Science 环境科学-林学
CiteScore
6.00
自引率
3.60%
发文量
60
审稿时长
2 months
期刊介绍: The Journal of Vegetation Science publishes papers on all aspects of plant community ecology, with particular emphasis on papers that develop new concepts or methods, test theory, identify general patterns, or that are otherwise likely to interest a broad international readership. Papers may focus on any aspect of vegetation science, e.g. community structure (including community assembly and plant functional types), biodiversity (including species richness and composition), spatial patterns (including plant geography and landscape ecology), temporal changes (including demography, community dynamics and palaeoecology) and processes (including ecophysiology), provided the focus is on increasing our understanding of plant communities. The Journal publishes papers on the ecology of a single species only if it plays a key role in structuring plant communities. Papers that apply ecological concepts, theories and methods to the vegetation management, conservation and restoration, and papers on vegetation survey should be directed to our associate journal, Applied Vegetation Science journal.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信