The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.

IF 2.5 4区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
PeerJ Computer Science Pub Date : 2025-08-29 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.3095
Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman
{"title":"The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.","authors":"Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman","doi":"10.7717/peerj-cs.3095","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called <i>clusters</i>, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the \"right\" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3095"},"PeriodicalIF":2.5000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453699/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3095","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called clusters, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the "right" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.

对于凹形和基于密度的聚类的无监督聚类内部评估,DBCV指数比DCSI、CDbw和VIASCKDE指数信息更丰富。
聚类方法是一种无监督的机器学习技术,它根据所采用的聚类算法定义的特定标准,将数据点聚集到特定的组中,称为聚类。由于聚类方法是无监督的,因此没有可用于评估其结果的基础真理或金标准信息,因此很难知道所获得的结果是好是坏。在这种情况下,可以使用几种内部聚类率,如Silhouette系数、Calinski-Harabasz指数、Davies-Bouldin、Dunn指数、Gap统计和Shannon熵,仅举几例。即使很流行,这些聚类内部分数也只有在用于评估凸形和分离良好的聚类时才有效,但在用于评估凹形和嵌套聚类时就失败了。在这些凹形和基于密度的情况下,其他系数可以提供信息:基于密度的聚类验证指数(DBCVI)、聚类之间和聚类内部的组成密度指数(CDbw)、密度聚类可分离指数(DCSI)、基于核密度估计的任意形状聚类有效性指数(VIASCKDE)。在这项研究中,我们精确地描述了DBCV指数,并将其结果与CDbw、DCSI和VIASCKDE在几个人工数据集和来自电子健康记录的真实医疗数据集上获得的结果进行了比较,这些数据集由基于密度的聚类方法产生,如基于密度的带噪声应用空间聚类(DBSCAN)。为了做到这一点,我们提出了一种基于聚类结果恶化或改善的创新方法,而不是像许多研究那样专注于搜索“正确”的聚类数量。此外,我们还推荐使用R和Python的开放软件包。我们的研究结果表明,在评估凹形、嵌套、聚类结果时,DBCV指数比CDbw、DCSI和VIASCKDE具有更高的可靠性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PeerJ Computer Science
PeerJ Computer Science Computer Science-General Computer Science
CiteScore
6.10
自引率
5.30%
发文量
332
审稿时长
10 weeks
期刊介绍: PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信