Ensemble machine learning and tree-structured Parzen estimator to predict early-stage pancreatic cancer

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2025-04-16 DOI:10.1016/j.bspc.2025.107867

Kah Keng Wong

{"title":"Ensemble machine learning and tree-structured Parzen estimator to predict early-stage pancreatic cancer","authors":"Kah Keng Wong","doi":"10.1016/j.bspc.2025.107867","DOIUrl":null,"url":null,"abstract":"<div><div>Pancreatic ductal adenocarcinoma (PDAC) is one of the most deadly malignancies due to challenges in diagnosing the disease at an early-stage. In this study, the RNA-sequencing data of tumor-educated platelets (GSE183635) were analyzed using machine learning (ML) algorithms to predict early-stage PDAC. Differentially expressed genes specific to early-stage PDAC were selected as features to build the ML models. The ML algorithms used were linear (logistic regression), and non-linear [support vector machine (SVM), random forest (RF), XGBoost (XGB), and LightGBM (GBM)] algorithms. Given the limitations of existing, non-probabilistic algorithms to optimize early-stage PDAC detection, tree-structured Parzen estimator (TPE) algorithm was utilized for hyperparameters optimization through probabilistic modeling. TPE identified the most optimal model for each ML algorithm, particularly effective for optimizing non-linear ML algorithms (SVM, RF, XGB, and GBM). To leverage the strengths of individual ML algorithms, ensemble modeling that combined up to a maximum of three individual algorithms demonstrated that a weighted ensemble integrating SVM, RF, and GBM (<em>i.e.</em>, SVM:RF:GBM ensemble model) outperformed individual models. The SVM:RF:GBM ensemble model showed optimal performance metrics in the calibrated test set (ROC AUC: 0.905; sensitivity: 0.857; specificity: 0.850). In both the calibrated training and test sets, the ensemble model demonstrated consistent performance as measured by 13 different performance metrics, and such consistency was not observed in individual models. In conclusion, the SVM:RF:GBM ensemble model optimized by TPE represents a novel predictive model for early-stage PDAC, and this study proposes a framework for predictive model construction in cancer diagnosis.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"108 ","pages":"Article 107867"},"PeriodicalIF":4.9000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425003787","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Pancreatic ductal adenocarcinoma (PDAC) is one of the most deadly malignancies due to challenges in diagnosing the disease at an early-stage. In this study, the RNA-sequencing data of tumor-educated platelets (GSE183635) were analyzed using machine learning (ML) algorithms to predict early-stage PDAC. Differentially expressed genes specific to early-stage PDAC were selected as features to build the ML models. The ML algorithms used were linear (logistic regression), and non-linear [support vector machine (SVM), random forest (RF), XGBoost (XGB), and LightGBM (GBM)] algorithms. Given the limitations of existing, non-probabilistic algorithms to optimize early-stage PDAC detection, tree-structured Parzen estimator (TPE) algorithm was utilized for hyperparameters optimization through probabilistic modeling. TPE identified the most optimal model for each ML algorithm, particularly effective for optimizing non-linear ML algorithms (SVM, RF, XGB, and GBM). To leverage the strengths of individual ML algorithms, ensemble modeling that combined up to a maximum of three individual algorithms demonstrated that a weighted ensemble integrating SVM, RF, and GBM (i.e., SVM:RF:GBM ensemble model) outperformed individual models. The SVM:RF:GBM ensemble model showed optimal performance metrics in the calibrated test set (ROC AUC: 0.905; sensitivity: 0.857; specificity: 0.850). In both the calibrated training and test sets, the ensemble model demonstrated consistent performance as measured by 13 different performance metrics, and such consistency was not observed in individual models. In conclusion, the SVM:RF:GBM ensemble model optimized by TPE represents a novel predictive model for early-stage PDAC, and this study proposes a framework for predictive model construction in cancer diagnosis.

Abstract Image

查看原文本刊更多论文

集成机器学习和树结构Parzen估计预测早期胰腺癌

胰导管腺癌（PDAC）是最致命的恶性肿瘤之一，由于在早期诊断疾病的挑战。本研究使用机器学习（ML）算法分析肿瘤诱导血小板（GSE183635）的rna测序数据，以预测早期PDAC。选择早期PDAC特异性差异表达基因作为特征构建ML模型。使用的机器学习算法是线性（逻辑回归）和非线性[支持向量机（SVM），随机森林（RF）， XGBoost （XGB）和LightGBM (GBM)]算法。鉴于现有非概率算法优化早期PDAC检测的局限性，通过概率建模，利用树结构Parzen估计器（TPE）算法进行超参数优化。TPE为每个机器学习算法确定了最优模型，对优化非线性机器学习算法（SVM、RF、XGB和GBM）特别有效。为了利用单个ML算法的优势，集成建模将最多三个单独的算法结合在一起，结果表明，集成SVM、RF和GBM的加权集成（即SVM:RF:GBM集成模型）优于单个模型。SVM:RF:GBM集成模型在校准测试集中表现出最优的性能指标(ROC AUC: 0.905；灵敏度:0.857;特异性:0.850)。在校准的训练集和测试集中，集成模型通过13种不同的性能指标显示出一致的性能，而在单个模型中没有观察到这种一致性。综上所述，经TPE优化的SVM:RF:GBM集成模型代表了一种新的早期PDAC预测模型，本研究为癌症诊断预测模型构建提供了框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.