Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-04-11 DOI:10.1186/s13040-025-00439-8

Amr Eledkawy, Taher Hamza, Sara El-Metwally

{"title":"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":null,"url":null,"abstract":"Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":6.1000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00439-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.

Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).

Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.

查看原文本刊更多论文

迈向精确肿瘤学：结合液体活检和机器学习的多层次癌症分类系统。

背景：每年有数百万人死于癌症。早期癌症检测对于确保更高的存活率至关重要，因为它为及时的医疗干预提供了机会。本文提出了一个多层次的癌症分类系统，该系统使用血浆cfDNA/ctDNA突变和蛋白质生物标志物来识别七种不同的癌症类型：结直肠癌、乳腺癌、上胃肠道、肺癌、胰腺癌、卵巢癌和肝癌。结果：提出的系统采用多阶段二元分类框架，其中每个阶段都是针对特定的癌症类型定制的。采用多数投票特征选择过程，结合六个特征选择器：信息值、卡方、随机森林特征重要性、额外树特征重要性、递归特征消除和L1正则化。在特征选择过程之后，分类器（包括极端梯度增强、随机森林、额外树和二次判别分析）分别针对每种癌症类型或在集成软投票设置中进行定制，以优化预测准确性。该系统优于先前发表的结果，AUC为98.2%，准确率为96.21%。为了确保结果的可重复性，本研究中使用的训练模型和数据集通过GitHub存储库（https://github.com/SaraEl-Metwally/Towards-Precision-Oncology）公开提供。结论：识别的生物标志物提高了诊断的可解释性，促进了更明智的决策。该系统的性能强调了其在组织定位方面的有效性，有助于通过及时的医疗干预改善患者的预后。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.