DBSCAN and DBCV application to open medical records heterogeneous data for identifying clinically significant clusters of patients with neuroblastoma.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-06-12 DOI:10.1186/s13040-025-00455-8

Davide Chicco, Luca Oneto, Davide Cangelosi

{"title":"DBSCAN and DBCV application to open medical records heterogeneous data for identifying clinically significant clusters of patients with neuroblastoma.","authors":"Davide Chicco, Luca Oneto, Davide Cangelosi","doi":"10.1186/s13040-025-00455-8","DOIUrl":null,"url":null,"abstract":"<p><p>Neuroblastoma is a common pediatric cancer that affects thousands of infants worldwide, especially children under five years of age. Although recovery for patients with neuroblastoma is possible in 80% of cases, only 40% of those with high-risk stage four neuroblastoma survive. Electronic health records of patients with this disease contain valuable data on patients that can be analyzed using computational intelligence and statistical software by biomedical informatics researchers. Unsupervised machine learning methods, in particular, can identify clinically significant subgroups of patients, which can lead to new therapies or medical treatments for future patients belonging to the same subgroups. However, access to these datasets is often restricted, making it difficult to obtain them for independent research projects. In this study, we retrieved three open datasets containing data from patients diagnosed with neuroblastoma: the Genoa dataset and the Shanghai dataset from the Neuroblastoma Electronic Health Records Open Data Repository, and a dataset from the TARGET-NBL renowned program. We analyzed these datasets using several clustering techniques and measured the results with the DBCV (Density-Based Clustering Validation) index. Among these algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was the only one that produced meaningful results. We scrutinized the two clusters of patients' profiles identified by DBSCAN in the three datasets and recognized several relevant clinical variables that clearly partitioned the patients into the two clusters that have clinical meaning in the neuroblastoma literature. Our results can have a significant impact on health informatics, because any computational analyst wishing to cluster small data of patients of a rare disease can choose to use DBSCAN and DBCV rather than utilizing more common methods such as k-Means and Silhouette coefficient.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"40"},"PeriodicalIF":6.1000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12164137/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00455-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Neuroblastoma is a common pediatric cancer that affects thousands of infants worldwide, especially children under five years of age. Although recovery for patients with neuroblastoma is possible in 80% of cases, only 40% of those with high-risk stage four neuroblastoma survive. Electronic health records of patients with this disease contain valuable data on patients that can be analyzed using computational intelligence and statistical software by biomedical informatics researchers. Unsupervised machine learning methods, in particular, can identify clinically significant subgroups of patients, which can lead to new therapies or medical treatments for future patients belonging to the same subgroups. However, access to these datasets is often restricted, making it difficult to obtain them for independent research projects. In this study, we retrieved three open datasets containing data from patients diagnosed with neuroblastoma: the Genoa dataset and the Shanghai dataset from the Neuroblastoma Electronic Health Records Open Data Repository, and a dataset from the TARGET-NBL renowned program. We analyzed these datasets using several clustering techniques and measured the results with the DBCV (Density-Based Clustering Validation) index. Among these algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was the only one that produced meaningful results. We scrutinized the two clusters of patients' profiles identified by DBSCAN in the three datasets and recognized several relevant clinical variables that clearly partitioned the patients into the two clusters that have clinical meaning in the neuroblastoma literature. Our results can have a significant impact on health informatics, because any computational analyst wishing to cluster small data of patients of a rare disease can choose to use DBSCAN and DBCV rather than utilizing more common methods such as k-Means and Silhouette coefficient.

Abstract Image

查看原文本刊更多论文

DBSCAN和DBCV应用于开放医疗记录异构数据，以识别临床意义重大的神经母细胞瘤患者群。

神经母细胞瘤是一种常见的儿科癌症，影响着全世界成千上万的婴儿，尤其是5岁以下的儿童。尽管80%的神经母细胞瘤患者有可能康复，但只有40%的高危四期神经母细胞瘤患者存活。患有这种疾病的患者的电子健康记录包含有价值的患者数据，可以由生物医学信息学研究人员使用计算智能和统计软件进行分析。特别是无监督的机器学习方法，可以识别临床显着的患者亚组，这可以为属于同一亚组的未来患者带来新的疗法或医学治疗。然而，对这些数据集的访问往往受到限制，这使得独立研究项目很难获得它们。在本研究中，我们检索了三个包含神经母细胞瘤患者数据的开放数据集：来自神经母细胞瘤电子健康记录开放数据库的热那亚数据集和上海数据集，以及来自TARGET-NBL知名项目的数据集。我们使用几种聚类技术分析这些数据集，并用DBCV（基于密度的聚类验证）指数测量结果。在这些算法中，DBSCAN（基于密度的空间聚类应用与噪声）是唯一产生有意义的结果。我们仔细研究了三个数据集中DBSCAN识别的两组患者资料，并识别出几个相关的临床变量，这些变量明确地将患者划分为神经母细胞瘤文献中具有临床意义的两组。我们的结果可能对健康信息学产生重大影响，因为任何希望对罕见疾病患者的小数据进行聚类的计算分析师都可以选择使用DBSCAN和DBCV，而不是使用更常见的方法，如k-Means和Silhouette系数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.