The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.
IF 2.5 4区 计算机科学Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman
{"title":"The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.","authors":"Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman","doi":"10.7717/peerj-cs.3095","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called <i>clusters</i>, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the \"right\" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3095"},"PeriodicalIF":2.5000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453699/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3095","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called clusters, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the "right" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.