Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-09-23 DOI:10.2196/73765

Wei Hou, Chuangwei Li, Zhen Wang, Wanqin Wang, Shouhong Wan, Bingbing Zou

{"title":"Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data.","authors":"Wei Hou, Chuangwei Li, Zhen Wang, Wanqin Wang, Shouhong Wan, Bingbing Zou","doi":"10.2196/73765","DOIUrl":null,"url":null,"abstract":"Background: Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.Objective: This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.Methods: Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.Results: LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.Conclusions: This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73765"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456929/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73765","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.

Objective: This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.

Methods: Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.

Results: LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.

Conclusions: This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.

查看原文本刊更多论文

预测直肠癌淋巴结转移：使用临床数据的机器学习模型的开发和验证。

背景：直肠癌（RC）是一种常见的恶性肿瘤，其淋巴结转移（LNM）是决定患者预后的关键因素。传统的诊断方法有局限性，需要利用临床数据开发预测模型。目的：本研究旨在建立并验证基于临床数据的机器学习（ML）模型来预测RC患者的LNM风险。方法：来自2454例RC （SEER [Surveillance, Epidemiology， and End Results]数据库）患者的回顾性数据分为训练组（n=1954）和内部验证组（n=500）。外部队列（n=500）来自安徽医科大学第一附属医院。通过计算机断层扫描确定的淋巴结特征与临床病理数据相结合。使用LASSO（最小绝对收缩和选择算子）选择变量，然后进行单变量和多变量逻辑回归。11个ML模型（逻辑回归、k近邻、极度随机树、朴素贝叶斯、XGBoost [XBG]、光梯度增强机、多层感知器、梯度增强、支持向量机、随机森林和Ada-Boost）通过接受者工作特征曲线（AUC）下的面积、校准曲线和决策曲线分析进行评估。结果：LNM患病率为26.9%（培训），27%（内部验证），81%（外部验证）。LNM的独立预测因子包括肿瘤分级、临床T分期、N分期、肿瘤长度、神经侵犯和淋巴结总数。内验证AUC范围为0.859 ~ 0.964；外部验证AUC为0.735 ~ 0.838。在内部验证集中，随机森林和极度随机树的AUC最高（0.964,95% CI 0.950-0.978），而XGB则表现出优异的跨队列稳定性（AUC 0.942, 95% CI 0.925-0.959）。对于外部验证，Gradient Boosting的AUC最高（0.838,95%CI 0.801-0.875），其次是XGB （0.832, 95%CI 0.794-0.869）。XGB在最接近理想对角线的曲线上显示了最小的校准误差，并在跨临界阈值的决策曲线分析中产生了最高的净收益。结论：本研究成功建立并验证了11个ML模型来预测RC的LNM风险。XGB模型是最优的，在10个内部模型中实现了AUC >0.9，在7个外部模型中实现了AUC >0.8。确定的LNM预测因子可以促进早期诊断和个性化治疗，突出了将计算机断层扫描数据与临床病理结果相结合以建立有效预测模型的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.