Naghmeh Pakgohar, Attila Lengyel, Zoltán Botta-Dukát
{"title":"利用二进制数据集对内部聚类验证指数进行定量评估","authors":"Naghmeh Pakgohar, Attila Lengyel, Zoltán Botta-Dukát","doi":"10.1111/jvs.13310","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Aims</h3>\n \n <p>Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Artificial binary data sets with equal- and unequal-sized well-separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty-seven clustering validation indices are evaluated including both geometric and non-geometric indices.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non-geometric indices, crispness and OptimClass performed best.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>We recommend using these best-performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.</p>\n </section>\n </div>","PeriodicalId":49965,"journal":{"name":"Journal of Vegetation Science","volume":"35 5","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jvs.13310","citationCount":"0","resultStr":"{\"title\":\"Quantitative evaluation of internal cluster validation indices using binary data sets\",\"authors\":\"Naghmeh Pakgohar, Attila Lengyel, Zoltán Botta-Dukát\",\"doi\":\"10.1111/jvs.13310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Aims</h3>\\n \\n <p>Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>Artificial binary data sets with equal- and unequal-sized well-separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty-seven clustering validation indices are evaluated including both geometric and non-geometric indices.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non-geometric indices, crispness and OptimClass performed best.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusion</h3>\\n \\n <p>We recommend using these best-performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.</p>\\n </section>\\n </div>\",\"PeriodicalId\":49965,\"journal\":{\"name\":\"Journal of Vegetation Science\",\"volume\":\"35 5\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jvs.13310\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Vegetation Science\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jvs.13310\",\"RegionNum\":3,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Vegetation Science","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jvs.13310","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ECOLOGY","Score":null,"Total":0}
Quantitative evaluation of internal cluster validation indices using binary data sets
Aims
Different clustering methods often classify the same data set differently. Selecting the “best” clustering solution from alternatives is possible with cluster validation indices. Because of the large variety of cluster validation indices (CVIs), choosing the most suitable index concerning the data set and clustering algorithms is challenging. We aim to assess different internal clustering validation indices.
Methods
Artificial binary data sets with equal- and unequal-sized well-separated a priori clusters were simulated and three levels of noise were then added. Twenty replications of each of the six types of data sets (two group sizes × three levels of noise) were created and analyzed by three clustering algorithms with Jaccard dissimilarity. Twenty-seven clustering validation indices are evaluated including both geometric and non-geometric indices.
Results
Although, in theory, all CVIs could differentiate between good and wrong classifications, only a few perform as expected with noisy data. Tau and silhouette widths proved to be the best geometric CVIs both for equal and unequal cluster sizes. Among non-geometric indices, crispness and OptimClass performed best.
Conclusion
We recommend using these best-performing CVIs. We suggest plotting the CVI value against the number of clusters because the lack of a sharp peak means that the position of the maximum is uncertain.
期刊介绍:
The Journal of Vegetation Science publishes papers on all aspects of plant community ecology, with particular emphasis on papers that develop new concepts or methods, test theory, identify general patterns, or that are otherwise likely to interest a broad international readership. Papers may focus on any aspect of vegetation science, e.g. community structure (including community assembly and plant functional types), biodiversity (including species richness and composition), spatial patterns (including plant geography and landscape ecology), temporal changes (including demography, community dynamics and palaeoecology) and processes (including ecophysiology), provided the focus is on increasing our understanding of plant communities. The Journal publishes papers on the ecology of a single species only if it plays a key role in structuring plant communities. Papers that apply ecological concepts, theories and methods to the vegetation management, conservation and restoration, and papers on vegetation survey should be directed to our associate journal, Applied Vegetation Science journal.