Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data.

IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS
Wei Hou, Chuangwei Li, Zhen Wang, Wanqin Wang, Shouhong Wan, Bingbing Zou
{"title":"Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data.","authors":"Wei Hou, Chuangwei Li, Zhen Wang, Wanqin Wang, Shouhong Wan, Bingbing Zou","doi":"10.2196/73765","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.</p><p><strong>Objective: </strong>This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.</p><p><strong>Methods: </strong>Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.</p><p><strong>Results: </strong>LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.</p><p><strong>Conclusions: </strong>This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73765"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456929/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/73765","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Rectal cancer (RC) is a common malignant tumor, with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.

Objective: This study aimed to construct and validate machine learning (ML) models to predict LNM risk in patients with RC based on clinical data.

Methods: Retrospective data from 2454 patients with RC (SEER [Surveillance, Epidemiology, and End Results] database) were split into training (n=1954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via computed tomographic scans were integrated with clinicopathological data. Variables were selected using LASSO (Least Absolute Shrinkage and Selection Operator), followed by univariate and multivariate logistic regression. Eleven ML models (Logistic Regression, K-Nearest Neighbors, Extremely Randomized Trees, Naive Bayes, XGBoost [XBG], Light Gradient Boosting Machine, Multilayer Perceptron, Gradient Boosting, Support Vector Machine, Random Forest, and Ada-Boost) were evaluated via area under the receiver operating characteristic curve (AUC), calibration curves, and decision curve analysis.

Results: LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged from 0.859 to 0.964; external validation AUC was 0.735-0.838. In the internal validation set, Random Forest and Extremely Randomized Trees achieved the highest AUC (0.964, 95% CI 0.950-0.978), while XGB demonstrated superior cross-cohort stability (AUC 0.942, 95% CI 0.925-0.959). For external validation, Gradient Boosting had the highest AUC (0.838, 95% CI 0.801-0.875), followed by XGB (0.832, 95%CI 0.794-0.869). XGB showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in decision curve analysis across critical thresholds.

Conclusions: This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGB model was optimal, achieving an AUC >0.9 in 10 internal models and an AUC >0.8 in 7 external models. The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating computed tomographic scan data with clinicopathological findings to build effective predictive models.

预测直肠癌淋巴结转移:使用临床数据的机器学习模型的开发和验证。
背景:直肠癌(RC)是一种常见的恶性肿瘤,其淋巴结转移(LNM)是决定患者预后的关键因素。传统的诊断方法有局限性,需要利用临床数据开发预测模型。目的:本研究旨在建立并验证基于临床数据的机器学习(ML)模型来预测RC患者的LNM风险。方法:来自2454例RC (SEER [Surveillance, Epidemiology, and End Results]数据库)患者的回顾性数据分为训练组(n=1954)和内部验证组(n=500)。外部队列(n=500)来自安徽医科大学第一附属医院。通过计算机断层扫描确定的淋巴结特征与临床病理数据相结合。使用LASSO(最小绝对收缩和选择算子)选择变量,然后进行单变量和多变量逻辑回归。11个ML模型(逻辑回归、k近邻、极度随机树、朴素贝叶斯、XGBoost [XBG]、光梯度增强机、多层感知器、梯度增强、支持向量机、随机森林和Ada-Boost)通过接受者工作特征曲线(AUC)下的面积、校准曲线和决策曲线分析进行评估。结果:LNM患病率为26.9%(培训),27%(内部验证),81%(外部验证)。LNM的独立预测因子包括肿瘤分级、临床T分期、N分期、肿瘤长度、神经侵犯和淋巴结总数。内验证AUC范围为0.859 ~ 0.964;外部验证AUC为0.735 ~ 0.838。在内部验证集中,随机森林和极度随机树的AUC最高(0.964,95% CI 0.950-0.978),而XGB则表现出优异的跨队列稳定性(AUC 0.942, 95% CI 0.925-0.959)。对于外部验证,Gradient Boosting的AUC最高(0.838,95%CI 0.801-0.875),其次是XGB (0.832, 95%CI 0.794-0.869)。XGB在最接近理想对角线的曲线上显示了最小的校准误差,并在跨临界阈值的决策曲线分析中产生了最高的净收益。结论:本研究成功建立并验证了11个ML模型来预测RC的LNM风险。XGB模型是最优的,在10个内部模型中实现了AUC >0.9,在7个外部模型中实现了AUC >0.8。确定的LNM预测因子可以促进早期诊断和个性化治疗,突出了将计算机断层扫描数据与临床病理结果相结合以建立有效预测模型的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Medical Informatics
JMIR Medical Informatics Medicine-Health Informatics
CiteScore
7.90
自引率
3.10%
发文量
173
审稿时长
12 weeks
期刊介绍: JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信