An online explainable ensemble machine learning model for predicting epidermal growth factor receptor mutation status in lung adenocarcinoma.

IF 3.5 2区医学 Q2 ONCOLOGY

Translational lung cancer research Pub Date : 2025-07-31 Epub Date: 2025-07-28 DOI:10.21037/tlcr-2025-237

Qilong Song, Xiaohu Li, Biao Song, Tingting Zhang, Xiankuo Hu, Ao Li, Dongchun Ma, Xuhong Min, Yongqiang Yu

{"title":"An online explainable ensemble machine learning model for predicting epidermal growth factor receptor mutation status in lung adenocarcinoma.","authors":"Qilong Song, Xiaohu Li, Biao Song, Tingting Zhang, Xiankuo Hu, Ao Li, Dongchun Ma, Xuhong Min, Yongqiang Yu","doi":"10.21037/tlcr-2025-237","DOIUrl":null,"url":null,"abstract":"Background: Non-invasive determination of epidermal growth factor receptor (EGFR) mutation status is essential for selecting lung adenocarcinoma patients suitable for EGFR-tyrosine kinase inhibitors (EGFR-TKIs). This study aimed to develop and validate an online ensemble machine learning (EML) model that combines multiple machine learning (ML) models to predict the EGFR mutation status in lung adenocarcinoma.Methods: A total of 823 lung adenocarcinoma patients with known EGFR mutation status from three medical centers were divided into a training cohort (n=556) and a validation cohort (n=267) (ChiCTR2400083082 in the WHO International Clinical Trials Registry). Five ML models incorporating clinical and radiological characteristics-random forest (RF), logistic regression (LR), support vector machine (SVM), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGBoost)-along with a CT-based deep learning (DL) model were constructed to predict EGFR mutation status. Subsequently, an EML model was created by combining these models. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), and the SHapley Additive exPlanation (SHAP) method was used to explain the EML model.Results: In the training cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.851, 0.790, 0.810, 0.835, 0.853, 0.884, and 0.928, respectively. In the validation cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.753, 0.744, 0.732, 0.749, 0.751, 0.754, and 0.813, respectively. The Delong test indicated that the AUC of the EML model showed outstanding performance compared to the single models in both the training and validation cohorts. Decision curve analysis indicated that the EML model provided a clinically useful net benefit, and calibration curves showed good agreement. SHAP analysis identified predictive characteristics ranked by their contribution to the EML model: DL score, long-axis diameter, smoking history, pleural retraction, texture, vascular convergence, sex, air bronchogram, and bubblelike lucency. These characteristics were further used to develop an online web tool.Conclusions: The EML model could serve as a non-invasive and accurate method for predicting EGFR mutation status in lung adenocarcinoma.","PeriodicalId":23271,"journal":{"name":"Translational lung cancer research","volume":"14 7","pages":"2670-2687"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337052/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Translational lung cancer research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/tlcr-2025-237","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Non-invasive determination of epidermal growth factor receptor (EGFR) mutation status is essential for selecting lung adenocarcinoma patients suitable for EGFR-tyrosine kinase inhibitors (EGFR-TKIs). This study aimed to develop and validate an online ensemble machine learning (EML) model that combines multiple machine learning (ML) models to predict the EGFR mutation status in lung adenocarcinoma.

Methods: A total of 823 lung adenocarcinoma patients with known EGFR mutation status from three medical centers were divided into a training cohort (n=556) and a validation cohort (n=267) (ChiCTR2400083082 in the WHO International Clinical Trials Registry). Five ML models incorporating clinical and radiological characteristics-random forest (RF), logistic regression (LR), support vector machine (SVM), light gradient boosting machine (LightGBM), and extreme gradient boosting (XGBoost)-along with a CT-based deep learning (DL) model were constructed to predict EGFR mutation status. Subsequently, an EML model was created by combining these models. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), and the SHapley Additive exPlanation (SHAP) method was used to explain the EML model.

Results: In the training cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.851, 0.790, 0.810, 0.835, 0.853, 0.884, and 0.928, respectively. In the validation cohort, the AUCs for the RF, LR, SVM, LightGBM, XGBoost, DL, and EML were 0.753, 0.744, 0.732, 0.749, 0.751, 0.754, and 0.813, respectively. The Delong test indicated that the AUC of the EML model showed outstanding performance compared to the single models in both the training and validation cohorts. Decision curve analysis indicated that the EML model provided a clinically useful net benefit, and calibration curves showed good agreement. SHAP analysis identified predictive characteristics ranked by their contribution to the EML model: DL score, long-axis diameter, smoking history, pleural retraction, texture, vascular convergence, sex, air bronchogram, and bubblelike lucency. These characteristics were further used to develop an online web tool.

Conclusions: The EML model could serve as a non-invasive and accurate method for predicting EGFR mutation status in lung adenocarcinoma.

查看原文本刊更多论文

预测肺腺癌中表皮生长因子受体突变状态的在线可解释集成机器学习模型。

背景：无创检测表皮生长因子受体（EGFR）突变状态对于选择适合使用EGFR-酪氨酸激酶抑制剂（EGFR- tkis）的肺腺癌患者至关重要。本研究旨在开发和验证一种在线集成机器学习（EML）模型，该模型结合多种机器学习（ML）模型来预测肺腺癌中EGFR突变状态。方法：来自三个医疗中心的823例已知EGFR突变状态的肺腺癌患者被分为培训队列（n=556）和验证队列（n=267）（WHO国际临床试验登记处的ChiCTR2400083082）。结合临床和放射学特征的5个ML模型——随机森林（RF）、逻辑回归（LR）、支持向量机（SVM）、光梯度增强机（LightGBM）和极端梯度增强（XGBoost）——以及基于ct的深度学习（DL）模型被构建来预测EGFR突变状态。随后，通过组合这些模型创建了一个EML模型。采用受试者工作特征曲线下面积（AUC）评估模型性能，采用SHapley加性解释（SHAP）方法解释EML模型。结果：在训练队列中，RF、LR、SVM、LightGBM、XGBoost、DL、EML的auc分别为0.851、0.790、0.810、0.835、0.853、0.884、0.928。在验证队列中，RF、LR、SVM、LightGBM、XGBoost、DL和EML的auc分别为0.753、0.744、0.732、0.749、0.751、0.754和0.813。Delong检验表明，EML模型的AUC在训练队列和验证队列中都比单一模型表现出出色的性能。决策曲线分析表明，EML模型提供了临床有用的净效益，校准曲线显示出良好的一致性。SHAP分析根据对EML模型的贡献确定了预测特征：DL评分、长轴直径、吸烟史、胸膜收缩、质地、血管会聚、性别、空气支气管图和泡状透明度。这些特点被进一步用于开发在线网络工具。结论：EML模型可作为一种无创、准确预测肺腺癌EGFR突变状态的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Translational lung cancer research Medicine-Oncology

CiteScore

7.20

自引率

2.50%

发文量

137

期刊介绍： Translational Lung Cancer Research(TLCR, Transl Lung Cancer Res, Print ISSN 2218-6751; Online ISSN 2226-4477) is an international, peer-reviewed, open-access journal, which was founded in March 2012. TLCR is indexed by PubMed/PubMed Central and the Chemical Abstracts Service (CAS) Databases. It is published quarterly the first year, and published bimonthly since February 2013. It provides practical up-to-date information on prevention, early detection, diagnosis, and treatment of lung cancer. Specific areas of its interest include, but not limited to, multimodality therapy, markers, imaging, tumor biology, pathology, chemoprevention, and technical advances related to lung cancer.