Data-driven prediction of prolonged air leak after video-assisted thoracoscopic surgery for lung cancer: Development and validation of machine-learning-based models using real-world data through the ePath system

IF 2.6 Q2 HEALTH POLICY & SERVICES

Learning Health Systems Pub Date : 2024-10-11 DOI:10.1002/lrh2.10469

Saori Tou, Koutarou Matsumoto, Asato Hashinokuchi, Fumihiko Kinoshita, Hideki Nakaguma, Yukio Kozuma, Rui Sugeta, Yasunobu Nohara, Takanori Yamashita, Yoshifumi Wakata, Tomoyoshi Takenaka, Kazunori Iwatani, Hidehisa Soejima, Tomoharu Yoshizumi, Naoki Nakashima, Masahiro Kamouchi

{"title":"Data-driven prediction of prolonged air leak after video-assisted thoracoscopic surgery for lung cancer: Development and validation of machine-learning-based models using real-world data through the ePath system","authors":"Saori Tou, Koutarou Matsumoto, Asato Hashinokuchi, Fumihiko Kinoshita, Hideki Nakaguma, Yukio Kozuma, Rui Sugeta, Yasunobu Nohara, Takanori Yamashita, Yoshifumi Wakata, Tomoyoshi Takenaka, Kazunori Iwatani, Hidehisa Soejima, Tomoharu Yoshizumi, Naoki Nakashima, Masahiro Kamouchi","doi":"10.1002/lrh2.10469","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>The reliability of data-driven predictions in real-world scenarios remains uncertain. This study aimed to develop and validate a machine-learning-based model for predicting clinical outcomes using real-world data from an electronic clinical pathway (ePath) system.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>All available data were collected from patients with lung cancer who underwent video-assisted thoracoscopic surgery at two independent hospitals utilizing the ePath system. The primary clinical outcome of interest was prolonged air leak (PAL), defined as drainage removal more than 2 days post-surgery. Data-driven prediction models were developed in a cohort of 314 patients from a university hospital applying sparse linear regression models (least absolute shrinkage and selection operator, ridge, and elastic net) and decision tree ensemble models (random forest and extreme gradient boosting). Model performance was then validated in a cohort of 154 patients from a tertiary hospital using the area under the receiver operating characteristic curve (AUROC) and calibration plots.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>To mitigate bias, variables with missing data related to PAL or those with high rates of missing data were excluded from the dataset. Fivefold cross-validation indicated improved AUROCs when utilizing key variables, even post-imputation of missing data. Dichotomizing continuous variables enhanced performance, particularly when fewer variables were employed in the decision tree ensemble models. Consequently, regression models incorporating seven key variables in complete case analysis demonstrated superior discriminatory ability for both internal (AUROCs: 0.77–0.84) and external cohorts (AUROCs: 0.75–0.84). These models exhibited satisfactory calibration in both cohorts.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>The data-driven prediction model implementing the ePath system exhibited adequate performance in predicting PAL post-video-assisted thoracoscopic surgery, optimizing variables and considering population characteristics in a real-world setting.</p>\n </section>\n </div>","PeriodicalId":43916,"journal":{"name":"Learning Health Systems","volume":"9 2","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/lrh2.10469","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Learning Health Systems","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/lrh2.10469","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

The reliability of data-driven predictions in real-world scenarios remains uncertain. This study aimed to develop and validate a machine-learning-based model for predicting clinical outcomes using real-world data from an electronic clinical pathway (ePath) system.

Methods

All available data were collected from patients with lung cancer who underwent video-assisted thoracoscopic surgery at two independent hospitals utilizing the ePath system. The primary clinical outcome of interest was prolonged air leak (PAL), defined as drainage removal more than 2 days post-surgery. Data-driven prediction models were developed in a cohort of 314 patients from a university hospital applying sparse linear regression models (least absolute shrinkage and selection operator, ridge, and elastic net) and decision tree ensemble models (random forest and extreme gradient boosting). Model performance was then validated in a cohort of 154 patients from a tertiary hospital using the area under the receiver operating characteristic curve (AUROC) and calibration plots.

Results

To mitigate bias, variables with missing data related to PAL or those with high rates of missing data were excluded from the dataset. Fivefold cross-validation indicated improved AUROCs when utilizing key variables, even post-imputation of missing data. Dichotomizing continuous variables enhanced performance, particularly when fewer variables were employed in the decision tree ensemble models. Consequently, regression models incorporating seven key variables in complete case analysis demonstrated superior discriminatory ability for both internal (AUROCs: 0.77–0.84) and external cohorts (AUROCs: 0.75–0.84). These models exhibited satisfactory calibration in both cohorts.

Conclusions

The data-driven prediction model implementing the ePath system exhibited adequate performance in predicting PAL post-video-assisted thoracoscopic surgery, optimizing variables and considering population characteristics in a real-world setting.

Abstract Image

查看原文本刊更多论文

肺癌视频胸腔镜手术后长时间空气泄漏的数据驱动预测：通过ePath系统使用真实世界数据开发和验证基于机器学习的模型

在现实场景中，数据驱动预测的可靠性仍然不确定。本研究旨在开发和验证一种基于机器学习的模型，该模型使用来自电子临床路径（ePath）系统的真实数据来预测临床结果。方法收集两家独立医院使用ePath系统行视频胸腔镜手术的肺癌患者的资料。主要的临床结果是延长的空气泄漏（PAL），定义为术后2天以上的引流。应用稀疏线性回归模型（最小绝对收缩和选择算子、脊线和弹性网）和决策树集成模型（随机森林和极端梯度增强），在一所大学医院的314名患者队列中开发了数据驱动的预测模型。然后在来自一家三级医院的154名患者的队列中，使用受试者工作特征曲线（AUROC）下的面积和校准图验证了模型的性能。结果：为了减轻偏倚，与PAL相关的数据缺失变量或数据缺失率高的变量被排除在数据集中。五倍交叉验证表明，当利用关键变量时，即使是缺失数据的后代入，auroc也得到了改善。连续变量的二分类提高了性能，特别是当决策树集成模型中使用较少的变量时。因此，在完整的病例分析中，纳入七个关键变量的回归模型对内部队列（AUROCs: 0.77-0.84）和外部队列（AUROCs: 0.75-0.84）都显示出卓越的区分能力。这些模型在两个队列中都显示出令人满意的校准。结论采用ePath系统的数据驱动预测模型在预测视频胸腔镜手术后PAL方面表现良好，优化了变量并考虑了现实环境中的人群特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊