Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation

IF 7.9 2区医学 Q1 OBSTETRICS & GYNECOLOGY

Breast Pub Date : 2025-06-09 DOI:10.1016/j.breast.2025.104517

Yirong Xiang , Jian Tie , Siyuan Zhang , Chen Shi , Changkuo Guo , Yushuo Peng , Zhaoqing Fan , Weihu Wang

{"title":"Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation","authors":"Yirong Xiang , Jian Tie , Siyuan Zhang , Chen Shi , Changkuo Guo , Yushuo Peng , Zhaoqing Fan , Weihu Wang","doi":"10.1016/j.breast.2025.104517","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.</div></div><div><h3>Materials and methods</h3><div>This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.</div></div><div><h3>Results</h3><div>Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790–0·843), 0.806 (0·760-0·857) and 0·864 (0·830–0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897–4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853–2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.</div></div><div><h3>Conclusion</h3><div>This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.</div></div>","PeriodicalId":9093,"journal":{"name":"Breast","volume":"82 ","pages":"Article 104517"},"PeriodicalIF":7.9000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Breast","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S096097762500534X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background

This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.

Materials and methods

This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.

Results

Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790–0·843), 0.806 (0·760-0·857) and 0·864 (0·830–0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897–4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853–2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.

Conclusion

This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.

查看原文本刊更多论文

预测乳腺癌内乳腺淋巴结转移的可解释机器学习模型：多方法开发和跨队列验证

本研究为乳腺癌患者的基线内乳腺淋巴结转移（IMNM）建立了一个可解释的机器学习模型。材料和方法本研究包括三个队列：来自北京大学肿瘤医院的衍生队列（n = 1997），同一中心的时间检测队列（n = 633）和SEER队列（n = 51,420）。采用了多种机器学习策略：最小绝对收缩和选择算子（LASSO）、Boruta、向后逐步回归和最佳子集用于特征选择，以及逻辑回归（LR）、支持向量机（SVM）、k近邻（KNN）和极端梯度增强（XGBoost）用于模型构建。在内部和时间测试队列中验证了表现最佳的模型。采用Shapley加性解释（SHAP）分析提高可解释性。结果6个临床特征（临床N分期、大小、分期、分类、分级、位置）通过支持向量机构建最终预测模型。该模型在训练队列、内部测试队列和时间测试队列中的auc分别为0.811（0.790 ~ 0.843）、0.806（0.760 ~ 0.857）和0.864(0.830 ~ 0.926)，具有较好的鲁棒性。高风险患者的DFS预后明显较差(HR 2.776, 95% CI: 1.897 - 4.064, p <；0.001)和OS (HR为1.962,95% CI: 1.853 ~ 0.077, p <；0·001)。建立了一个在线预测工具，允许用户输入关键的临床变量，并获得模型预测的概率以及基于shap的解释。结论：这个经过验证且可解释的机器学习模型为早期风险分层提供了一个实用的工具，帮助临床医生选择合适的基线成像和辅助治疗计划。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Breast 医学-妇产科学

CiteScore

8.70

自引率

2.60%

发文量

165

审稿时长

59 days

期刊介绍： The Breast is an international, multidisciplinary journal for researchers and clinicians, which focuses on translational and clinical research for the advancement of breast cancer prevention, diagnosis and treatment of all stages.