Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation
Yirong Xiang , Jian Tie , Siyuan Zhang , Chen Shi , Changkuo Guo , Yushuo Peng , Zhaoqing Fan , Weihu Wang
{"title":"Explainable machine learning model for predicting internal mammary node metastasis in breast cancer: Multi-method development and cross-cohort validation","authors":"Yirong Xiang , Jian Tie , Siyuan Zhang , Chen Shi , Changkuo Guo , Yushuo Peng , Zhaoqing Fan , Weihu Wang","doi":"10.1016/j.breast.2025.104517","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.</div></div><div><h3>Materials and methods</h3><div>This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.</div></div><div><h3>Results</h3><div>Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790–0·843), 0.806 (0·760-0·857) and 0·864 (0·830–0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897–4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853–2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.</div></div><div><h3>Conclusion</h3><div>This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.</div></div>","PeriodicalId":9093,"journal":{"name":"Breast","volume":"82 ","pages":"Article 104517"},"PeriodicalIF":5.7000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Breast","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S096097762500534X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
This study developed an explainable machine learning model for baseline internal mammary lymph node metastasis (IMNM) in breast cancer patients.
Materials and methods
This study included three cohorts: a derivation cohort (n = 1997) from Peking University Cancer Hospital, a temporal testing cohort (n = 633) from the same center, and a SEER cohort (n = 51,420). Multiple machine learning strategies were conducted: Least Absolute Shrinkage and Selection Operator (LASSO), Boruta, backward stepwise regression, and best subset for feature selection, and logistic regression (LR), support vector machines (SVM), k-nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for model construction. The best-performing model was validated across internal and temporal testing cohorts. Shapley Additive Explanations (SHAP) analysis was conducted to improve interpretability.
Results
Six clinical features (clinical N stage, size, stage, classification, grade and location) were used to construct the final predictive model with SVM. The model achieved robust performance, with AUCs of 0·811 (0·790–0·843), 0.806 (0·760-0·857) and 0·864 (0·830–0·926) in the training, internal testing and temporal testing cohort, respectively. High-risk patients exhibited significantly worse outcomes with DFS (HR 2·776, 95 % CI: 1·897–4·064, p < 0·001) and OS (HR of 1·962, 95 % CI: 1·853–2·077, p < 0·001). An online prediction tool was established that allows users to input key clinical variables and obtain model-predicted probabilities along with SHAP-based explanations.
Conclusion
This validated and explainable machine learning model offers a practical tool for early risk stratification, aiding clinicians in appropriate baseline imaging selection and adjuvant treatment planning.
期刊介绍:
The Breast is an international, multidisciplinary journal for researchers and clinicians, which focuses on translational and clinical research for the advancement of breast cancer prevention, diagnosis and treatment of all stages.