{"title":"Ensemble machine learning and tree-structured Parzen estimator to predict early-stage pancreatic cancer","authors":"Kah Keng Wong","doi":"10.1016/j.bspc.2025.107867","DOIUrl":null,"url":null,"abstract":"<div><div>Pancreatic ductal adenocarcinoma (PDAC) is one of the most deadly malignancies due to challenges in diagnosing the disease at an early-stage. In this study, the RNA-sequencing data of tumor-educated platelets (GSE183635) were analyzed using machine learning (ML) algorithms to predict early-stage PDAC. Differentially expressed genes specific to early-stage PDAC were selected as features to build the ML models. The ML algorithms used were linear (logistic regression), and non-linear [support vector machine (SVM), random forest (RF), XGBoost (XGB), and LightGBM (GBM)] algorithms. Given the limitations of existing, non-probabilistic algorithms to optimize early-stage PDAC detection, tree-structured Parzen estimator (TPE) algorithm was utilized for hyperparameters optimization through probabilistic modeling. TPE identified the most optimal model for each ML algorithm, particularly effective for optimizing non-linear ML algorithms (SVM, RF, XGB, and GBM). To leverage the strengths of individual ML algorithms, ensemble modeling that combined up to a maximum of three individual algorithms demonstrated that a weighted ensemble integrating SVM, RF, and GBM (<em>i.e.</em>, SVM:RF:GBM ensemble model) outperformed individual models. The SVM:RF:GBM ensemble model showed optimal performance metrics in the calibrated test set (ROC AUC: 0.905; sensitivity: 0.857; specificity: 0.850). In both the calibrated training and test sets, the ensemble model demonstrated consistent performance as measured by 13 different performance metrics, and such consistency was not observed in individual models. In conclusion, the SVM:RF:GBM ensemble model optimized by TPE represents a novel predictive model for early-stage PDAC, and this study proposes a framework for predictive model construction in cancer diagnosis.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"108 ","pages":"Article 107867"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425003787","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Pancreatic ductal adenocarcinoma (PDAC) is one of the most deadly malignancies due to challenges in diagnosing the disease at an early-stage. In this study, the RNA-sequencing data of tumor-educated platelets (GSE183635) were analyzed using machine learning (ML) algorithms to predict early-stage PDAC. Differentially expressed genes specific to early-stage PDAC were selected as features to build the ML models. The ML algorithms used were linear (logistic regression), and non-linear [support vector machine (SVM), random forest (RF), XGBoost (XGB), and LightGBM (GBM)] algorithms. Given the limitations of existing, non-probabilistic algorithms to optimize early-stage PDAC detection, tree-structured Parzen estimator (TPE) algorithm was utilized for hyperparameters optimization through probabilistic modeling. TPE identified the most optimal model for each ML algorithm, particularly effective for optimizing non-linear ML algorithms (SVM, RF, XGB, and GBM). To leverage the strengths of individual ML algorithms, ensemble modeling that combined up to a maximum of three individual algorithms demonstrated that a weighted ensemble integrating SVM, RF, and GBM (i.e., SVM:RF:GBM ensemble model) outperformed individual models. The SVM:RF:GBM ensemble model showed optimal performance metrics in the calibrated test set (ROC AUC: 0.905; sensitivity: 0.857; specificity: 0.850). In both the calibrated training and test sets, the ensemble model demonstrated consistent performance as measured by 13 different performance metrics, and such consistency was not observed in individual models. In conclusion, the SVM:RF:GBM ensemble model optimized by TPE represents a novel predictive model for early-stage PDAC, and this study proposes a framework for predictive model construction in cancer diagnosis.
期刊介绍:
Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management.
Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.