{"title":"Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data.","authors":"Wei Hou, Chuangwei Li, Zhen Wang, Wanqin Wang, Shouhong Wan, Bingbing Zou","doi":"10.2196/73765","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.</p><p><strong>Objective: </strong>This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.</p><p><strong>Methods: </strong>Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.</p><p><strong>Results: </strong>LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.</p><p><strong>Conclusions: </strong>This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73765"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456929/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73765","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.
Objective: This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.
Methods: Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.
Results: LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.
Conclusions: This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.