Multi-model machine learning framework for lung cancer risk prediction: A comparative analysis of nine classifiers with hybrid and ensemble approaches using behavioral and hematological parameters

IF 2.5 4区 医学 Q3 BIOCHEMICAL RESEARCH METHODS
Vinod Kumar , Chander prabha , Deepali Gupta , Sapna Juneja , Swati Kumari , Ali Nauman
{"title":"Multi-model machine learning framework for lung cancer risk prediction: A comparative analysis of nine classifiers with hybrid and ensemble approaches using behavioral and hematological parameters","authors":"Vinod Kumar ,&nbsp;Chander prabha ,&nbsp;Deepali Gupta ,&nbsp;Sapna Juneja ,&nbsp;Swati Kumari ,&nbsp;Ali Nauman","doi":"10.1016/j.slast.2025.100314","DOIUrl":null,"url":null,"abstract":"<div><div>LC continues to be the most prevalent cause of cancer deaths worldwide, which calls for sophisticated detection strategies. The present study investigates 34 demographic, behavioral, and hematological risk factors based on a sample of 2,000 patient data records. A multi-model machine learning approach compares nine algorithms: KNN, AdaBoost (AB), logistic regression (LR), random forest (RF), SVM, naive Bayes (NB), decision tree (DT), gradient boosting (GB), and stochastic gradient descent (SGD). Performance measures (accuracy, sensitivity, specificity, F1-score, AUC) identify quantitative differences: GB had the best F1-scores (0.953) and NB had the second-best F1-score (0.945), while GB had the best sensitivity (99.1 %). The KNN-AB hybrid model reported the highest accuracy with 99.5 %, while RF reported the highest AUC with a value of 0.92. Ensemble approaches (RF, GB) showed robust predictive performance across measures through integration of complementary strengths of base models. Lasso and ridge regression were able to minimize overfitting, making them easier to interpret. Therapeutic uses include integration into electronic health records (EHRs) for computerized risk stratification, LC screening earlier, and public health interventions in high-risk subjects (smokers with abnormal hematologic markers). The research highlights the value of hybrid ML models to integrate behavioral and biological data to effectively predict LC. Subsequent work can expand predictive capabilities through imaging data and genomics data incorporation, and continue to advance early identification and patient-specific therapy options. This is an intersection of computational advances and clinical translation, providing scalable solutions for global LC diagnosis.</div></div>","PeriodicalId":54248,"journal":{"name":"SLAS Technology","volume":"33 ","pages":"Article 100314"},"PeriodicalIF":2.5000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SLAS Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S247263032500072X","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

LC continues to be the most prevalent cause of cancer deaths worldwide, which calls for sophisticated detection strategies. The present study investigates 34 demographic, behavioral, and hematological risk factors based on a sample of 2,000 patient data records. A multi-model machine learning approach compares nine algorithms: KNN, AdaBoost (AB), logistic regression (LR), random forest (RF), SVM, naive Bayes (NB), decision tree (DT), gradient boosting (GB), and stochastic gradient descent (SGD). Performance measures (accuracy, sensitivity, specificity, F1-score, AUC) identify quantitative differences: GB had the best F1-scores (0.953) and NB had the second-best F1-score (0.945), while GB had the best sensitivity (99.1 %). The KNN-AB hybrid model reported the highest accuracy with 99.5 %, while RF reported the highest AUC with a value of 0.92. Ensemble approaches (RF, GB) showed robust predictive performance across measures through integration of complementary strengths of base models. Lasso and ridge regression were able to minimize overfitting, making them easier to interpret. Therapeutic uses include integration into electronic health records (EHRs) for computerized risk stratification, LC screening earlier, and public health interventions in high-risk subjects (smokers with abnormal hematologic markers). The research highlights the value of hybrid ML models to integrate behavioral and biological data to effectively predict LC. Subsequent work can expand predictive capabilities through imaging data and genomics data incorporation, and continue to advance early identification and patient-specific therapy options. This is an intersection of computational advances and clinical translation, providing scalable solutions for global LC diagnosis.
肺癌风险预测的多模型机器学习框架:使用行为和血液学参数混合和集成方法的九种分类器的比较分析。
LC仍然是世界范围内最普遍的癌症死亡原因,这需要复杂的检测策略。本研究基于2000例患者数据记录调查了34个人口统计学、行为学和血液学风险因素。一种多模型机器学习方法比较了九种算法:KNN、AdaBoost (AB)、逻辑回归(LR)、随机森林(RF)、SVM、朴素贝叶斯(NB)、决策树(DT)、梯度增强(GB)和随机梯度下降(SGD)。性能指标(准确性、敏感性、特异性、f1评分、AUC)确定了定量差异:国标的f1评分最高(0.953),NB的f1评分次之(0.945),国标的灵敏度最高(99.1%)。KNN-AB混合模型的准确率最高,为99.5%,RF模型的AUC最高,为0.92。集成方法(RF, GB)通过整合基础模型的互补优势,在测量中显示出稳健的预测性能。Lasso和ridge回归能够最小化过拟合,使它们更容易解释。治疗用途包括整合到电子健康记录(EHRs)中进行计算机化风险分层,早期LC筛查,以及对高危受试者(血液学标志物异常的吸烟者)进行公共卫生干预。该研究强调了混合ML模型整合行为和生物学数据以有效预测LC的价值。后续工作可以通过结合成像数据和基因组学数据来扩展预测能力,并继续推进早期识别和患者特异性治疗方案。这是计算进步和临床翻译的交集,为全球LC诊断提供可扩展的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SLAS Technology
SLAS Technology Computer Science-Computer Science Applications
CiteScore
6.30
自引率
7.40%
发文量
47
审稿时长
106 days
期刊介绍: SLAS Technology emphasizes scientific and technical advances that enable and improve life sciences research and development; drug-delivery; diagnostics; biomedical and molecular imaging; and personalized and precision medicine. This includes high-throughput and other laboratory automation technologies; micro/nanotechnologies; analytical, separation and quantitative techniques; synthetic chemistry and biology; informatics (data analysis, statistics, bio, genomic and chemoinformatics); and more.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信