{"title":"迈向精确肿瘤学:结合液体活检和机器学习的多层次癌症分类系统。","authors":"Amr Eledkawy, Taher Hamza, Sara El-Metwally","doi":"10.1186/s13040-025-00439-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.</p><p><strong>Results: </strong>The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).</p><p><strong>Conclusion: </strong>The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"29"},"PeriodicalIF":4.0000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/","citationCount":"0","resultStr":"{\"title\":\"Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.\",\"authors\":\"Amr Eledkawy, Taher Hamza, Sara El-Metwally\",\"doi\":\"10.1186/s13040-025-00439-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.</p><p><strong>Results: </strong>The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).</p><p><strong>Conclusion: </strong>The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.</p>\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"18 1\",\"pages\":\"29\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11987386/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-025-00439-8\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00439-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning.
Background: Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.
Results: The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ).
Conclusion: The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.
期刊介绍:
BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.
Topical areas include, but are not limited to:
-Development, evaluation, and application of novel data mining and machine learning algorithms.
-Adaptation, evaluation, and application of traditional data mining and machine learning algorithms.
-Open-source software for the application of data mining and machine learning algorithms.
-Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies.
-Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.