The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.

IF 2.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-08-29 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.3095

Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman

{"title":"The DBCV index is more informative than DCSI, CDbw, and VIASCKDE indices for unsupervised clustering internal assessment of concave-shaped and density-based clusters.","authors":"Davide Chicco, Giuseppe Sabino, Luca Oneto, Giuseppe Jurman","doi":"10.7717/peerj-cs.3095","DOIUrl":null,"url":null,"abstract":"Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called clusters, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the \"right\" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3095"},"PeriodicalIF":2.5000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453699/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3095","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering methods are unsupervised machine learning techniques that aggregate data points into specific groups, called clusters, according to specific criteria defined by the clustering algorithm employed. Since clustering methods are unsupervised, no ground truth or gold standard information is available to assess its results, making it challenging to know the results obtained are good or not. In this context, several clustering internal rates are available, like Silhouette coefficient, Calinski-Harabasz index, Davies-Bouldin, Dunn index, Gap statistic, and Shannon entropy, just to mention a few. Even if popular, these clustering internal scores work well only when used to assess convex-shaped and well-separated clusters, but they fail when utilized to evaluate concave-shaped and nested clusters. In these concave-shaped and density-based cases, other coefficients can be informative: Density-Based Clustering Validation Index (DBCVI), Compose Density between and within clusters Index (CDbw), Density Cluster Separability Index (DCSI), Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (VIASCKDE). In this study, we describe the DBCV index precisely, and compare its outcomes with the outcomes obtained by CDbw, DCSI, and VIASCKDE on several artificial datasets and on real-world medical datasets derived from electronic health records, produced by density-based clustering methods such as density-based spatial clustering of applications with noise (DBSCAN). To do so, we propose an innovative approach based on clustering result worsening or improving, rather than focusing on searching the "right" number of clusters like many studies do. Moreover, we also recommend open software packages in R and Python for its usage. Our results demonstrate the higher reliability of the DBCV index over CDbw, DCSI, and VIASCKDE when assessing concave-shaped, nested, clustering results.

查看原文本刊更多论文

对于凹形和基于密度的聚类的无监督聚类内部评估，DBCV指数比DCSI、CDbw和VIASCKDE指数信息更丰富。

聚类方法是一种无监督的机器学习技术，它根据所采用的聚类算法定义的特定标准，将数据点聚集到特定的组中，称为聚类。由于聚类方法是无监督的，因此没有可用于评估其结果的基础真理或金标准信息，因此很难知道所获得的结果是好是坏。在这种情况下，可以使用几种内部聚类率，如Silhouette系数、Calinski-Harabasz指数、Davies-Bouldin、Dunn指数、Gap统计和Shannon熵，仅举几例。即使很流行，这些聚类内部分数也只有在用于评估凸形和分离良好的聚类时才有效，但在用于评估凹形和嵌套聚类时就失败了。在这些凹形和基于密度的情况下，其他系数可以提供信息：基于密度的聚类验证指数（DBCVI）、聚类之间和聚类内部的组成密度指数（CDbw）、密度聚类可分离指数（DCSI）、基于核密度估计的任意形状聚类有效性指数（VIASCKDE）。在这项研究中，我们精确地描述了DBCV指数，并将其结果与CDbw、DCSI和VIASCKDE在几个人工数据集和来自电子健康记录的真实医疗数据集上获得的结果进行了比较，这些数据集由基于密度的聚类方法产生，如基于密度的带噪声应用空间聚类（DBSCAN）。为了做到这一点，我们提出了一种基于聚类结果恶化或改善的创新方法，而不是像许多研究那样专注于搜索“正确”的聚类数量。此外，我们还推荐使用R和Python的开放软件包。我们的研究结果表明，在评估凹形、嵌套、聚类结果时，DBCV指数比CDbw、DCSI和VIASCKDE具有更高的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.