Machine learning-based prediction of gastroesophageal junction cancer using electronic medical records.

IF 3.5 4区医学 Q2 MEDICINE, RESEARCH & EXPERIMENTAL

Clinical and Experimental Medicine Pub Date : 2025-08-18 DOI:10.1007/s10238-025-01835-4

Meng Qian, Ying Chen, Xiaofen Wu, Zhenxiang Wang, Ye Chen, Yan Zhang, Bo Li, Huihui Sun, Shuchang Xu

{"title":"Machine learning-based prediction of gastroesophageal junction cancer using electronic medical records.","authors":"Meng Qian, Ying Chen, Xiaofen Wu, Zhenxiang Wang, Ye Chen, Yan Zhang, Bo Li, Huihui Sun, Shuchang Xu","doi":"10.1007/s10238-025-01835-4","DOIUrl":null,"url":null,"abstract":"<p><p>Discriminating whether esophageal-related symptoms result from gastroesophageal junction cancer (GEJC) is challenging in clinical practice. This study aimed to develop and validate a tool to predict the likelihood of GEJC in patients with esophageal-related symptoms. The electronic medical record system was accessed to identify patients diagnosed with GEJC or gastroesophageal reflux disease (GERD) at our hospital between 2009 and 2023. Predictive variables included demographic characteristics, symptoms, and laboratory results. After propensity score matching, significant features of GEJC were screened using the least absolute shrinkage and selection operator (LASSO), Boruta, and logistic regression analysis. Patients were randomly divided into training and test cohorts in a 2:1 ratio. Four machine learning models were trained and validated for predicting GEJC patients. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), residual analysis, calibration curve, and Brier score. Additionally, Shapley Additive exPlanations analysis was used to explain the importance of different features. After matching, 401 GEJC patients were enrolled and compared with 401 GERD controls. Using the variables identified by LASSO, Boruta, and logistic regression analysis, we constructed four machine learning models including random forest, generalized linear model, extreme gradient boosting (XGBoost), and support vector machine. XGBoost exhibited better predictive performance with an AUC of 0.907 in the test cohort. The calibration curve of the XGBoost model also demonstrated strong consistency with a Brier score of 0.088. Body mass index, hemoglobin, age, reflux, and dysphagia were found to be significant influences on the model output. We developed a well-performing model for predicting GEJC using electronic medical records. Implementing this prediction tool in clinical practice may guide diagnostic strategies and provide appropriate interventions.</p>","PeriodicalId":10337,"journal":{"name":"Clinical and Experimental Medicine","volume":"25 1","pages":"295"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12358332/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Experimental Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10238-025-01835-4","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Discriminating whether esophageal-related symptoms result from gastroesophageal junction cancer (GEJC) is challenging in clinical practice. This study aimed to develop and validate a tool to predict the likelihood of GEJC in patients with esophageal-related symptoms. The electronic medical record system was accessed to identify patients diagnosed with GEJC or gastroesophageal reflux disease (GERD) at our hospital between 2009 and 2023. Predictive variables included demographic characteristics, symptoms, and laboratory results. After propensity score matching, significant features of GEJC were screened using the least absolute shrinkage and selection operator (LASSO), Boruta, and logistic regression analysis. Patients were randomly divided into training and test cohorts in a 2:1 ratio. Four machine learning models were trained and validated for predicting GEJC patients. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), residual analysis, calibration curve, and Brier score. Additionally, Shapley Additive exPlanations analysis was used to explain the importance of different features. After matching, 401 GEJC patients were enrolled and compared with 401 GERD controls. Using the variables identified by LASSO, Boruta, and logistic regression analysis, we constructed four machine learning models including random forest, generalized linear model, extreme gradient boosting (XGBoost), and support vector machine. XGBoost exhibited better predictive performance with an AUC of 0.907 in the test cohort. The calibration curve of the XGBoost model also demonstrated strong consistency with a Brier score of 0.088. Body mass index, hemoglobin, age, reflux, and dysphagia were found to be significant influences on the model output. We developed a well-performing model for predicting GEJC using electronic medical records. Implementing this prediction tool in clinical practice may guide diagnostic strategies and provide appropriate interventions.

Abstract Image

查看原文本刊更多论文

基于机器学习的电子病历预测胃食管结癌。

判别食管相关症状是否由胃食管结癌（GEJC）引起在临床实践中具有挑战性。本研究旨在开发并验证一种预测食管相关症状患者发生GEJC可能性的工具。进入电子病历系统以识别2009年至2023年间在我院诊断为GEJC或胃食管反流病（GERD）的患者。预测变量包括人口统计学特征、症状和实验室结果。倾向评分匹配后，使用最小绝对收缩和选择算子（LASSO）、Boruta和逻辑回归分析筛选GEJC的显著特征。患者按2:1的比例随机分为训练组和测试组。对四种机器学习模型进行了训练和验证，以预测GEJC患者。使用受试者工作特征曲线下面积（AUC）、残差分析、校准曲线和Brier评分来评估模型的性能。此外，采用Shapley加性解释分析来解释不同特征的重要性。匹配后，401名GEJC患者入组，并与401名GERD对照组进行比较。利用LASSO、Boruta和logistic回归分析确定的变量，我们构建了随机森林、广义线性模型、极端梯度增强（XGBoost）和支持向量机4种机器学习模型。XGBoost在测试队列中表现出较好的预测性能，AUC为0.907。XGBoost模型的校准曲线也具有很强的一致性，Brier评分为0.088。发现体重指数、血红蛋白、年龄、反流和吞咽困难对模型输出有显著影响。我们开发了一个使用电子病历预测GEJC的性能良好的模型。在临床实践中实施这种预测工具可以指导诊断策略并提供适当的干预措施。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical and Experimental Medicine 医学-医学：研究与实验

CiteScore

4.80

自引率

2.20%

发文量

159

审稿时长

2.5 months

期刊介绍： Clinical and Experimental Medicine (CEM) is a multidisciplinary journal that aims to be a forum of scientific excellence and information exchange in relation to the basic and clinical features of the following fields: hematology, onco-hematology, oncology, virology, immunology, and rheumatology. The journal publishes reviews and editorials, experimental and preclinical studies, translational research, prospectively designed clinical trials, and epidemiological studies. Papers containing new clinical or experimental data that are likely to contribute to changes in clinical practice or the way in which a disease is thought about will be given priority due to their immediate importance. Case reports will be accepted on an exceptional basis only, and their submission is discouraged. The major criteria for publication are clarity, scientific soundness, and advances in knowledge. In compliance with the overwhelmingly prevailing request by the international scientific community, and with respect for eco-compatibility issues, CEM is now published exclusively online.