Machine learning computational model to predict lung cancer using electronic medical records

IF 2.4 3区 医学 Q3 ONCOLOGY
Matanel Levi , Teddy Lazebnik , Shiri Kushnir , Noga Yosef , Dekel Shlomi
{"title":"Machine learning computational model to predict lung cancer using electronic medical records","authors":"Matanel Levi ,&nbsp;Teddy Lazebnik ,&nbsp;Shiri Kushnir ,&nbsp;Noga Yosef ,&nbsp;Dekel Shlomi","doi":"10.1016/j.canep.2024.102631","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Lung cancer (LC) screening using low-dose computed tomography (CT) is recommended according to standard risk criteria or personalized risk calculators. Machine learning (ML) models that can predict disease risk are an emerging method in medicine for identifying hidden associations that are personally unique.</p></div><div><h3>Materials and methods</h3><p>Using the tree-based pipeline optimization tool (TPOT), we developed an ML-based model, which is an ensemble of the Random Forest and XGboost models, based on known risk factors for LC, as part of a larger trial for ML prediction using electronic medical records and chest CT. We used data from patients with LC vs. controls (1:2) of patients aged ≥ 35 years. We developed a model for all LC patients as well as for patients with and without a smoking background. We included age, gender, body mass index (BMI), smoking history, socioeconomic status (SES), history of chronic obstructive pulmonary disease (COPD)/emphysema/chronic bronchitis (CB), interstitial lung disease (ILD)/pulmonary fibrosis (PF), and family history of LC.</p></div><div><h3>Results</h3><p>Of the 4076 patients, 1428 (35 %) were in the LC group and 2648 (65 %) were in the control group. For the entire study population, our model achieved an accuracy of 71.2 %, with a sensitivity of 69 % and a positive predictive value (PPV) of 74 %. Higher accuracy was achieved for the two subgroups. An accuracy of 74.8 % (sensitivity 72 %, PPV 76 %) and 73.0 % (sensitivity 76 %, PPV 72 %) was achieved for the smoking and never-smoking cohorts, respectively. For the entire population and smoker cohort, COPD/emphysema/CB were the most important contributors, followed by BMI and age, while in the never-smoking cohort, BMI, age and SES were the most important contributors.</p></div><div><h3>Conclusion</h3><p>Known risk factors for LC could be used in ML models to modestly predict LC. Further studies are needed to confirm these results in new patients and to improve them.</p></div>","PeriodicalId":56322,"journal":{"name":"Cancer Epidemiology","volume":"92 ","pages":"Article 102631"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877782124001103","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Lung cancer (LC) screening using low-dose computed tomography (CT) is recommended according to standard risk criteria or personalized risk calculators. Machine learning (ML) models that can predict disease risk are an emerging method in medicine for identifying hidden associations that are personally unique.

Materials and methods

Using the tree-based pipeline optimization tool (TPOT), we developed an ML-based model, which is an ensemble of the Random Forest and XGboost models, based on known risk factors for LC, as part of a larger trial for ML prediction using electronic medical records and chest CT. We used data from patients with LC vs. controls (1:2) of patients aged ≥ 35 years. We developed a model for all LC patients as well as for patients with and without a smoking background. We included age, gender, body mass index (BMI), smoking history, socioeconomic status (SES), history of chronic obstructive pulmonary disease (COPD)/emphysema/chronic bronchitis (CB), interstitial lung disease (ILD)/pulmonary fibrosis (PF), and family history of LC.

Results

Of the 4076 patients, 1428 (35 %) were in the LC group and 2648 (65 %) were in the control group. For the entire study population, our model achieved an accuracy of 71.2 %, with a sensitivity of 69 % and a positive predictive value (PPV) of 74 %. Higher accuracy was achieved for the two subgroups. An accuracy of 74.8 % (sensitivity 72 %, PPV 76 %) and 73.0 % (sensitivity 76 %, PPV 72 %) was achieved for the smoking and never-smoking cohorts, respectively. For the entire population and smoker cohort, COPD/emphysema/CB were the most important contributors, followed by BMI and age, while in the never-smoking cohort, BMI, age and SES were the most important contributors.

Conclusion

Known risk factors for LC could be used in ML models to modestly predict LC. Further studies are needed to confirm these results in new patients and to improve them.

利用电子病历预测肺癌的机器学习计算模型。
背景:根据标准风险标准或个性化风险计算器推荐使用低剂量计算机断层扫描(CT)进行肺癌(LC)筛查。能够预测疾病风险的机器学习(ML)模型是医学界一种新兴的方法,可用于识别个人独特的隐性关联:我们使用基于树的管道优化工具(TPOT)开发了一个基于 ML 的模型,该模型是随机森林模型和 XGboost 模型的集合,以 LC 的已知风险因素为基础,是使用电子病历和胸部 CT 进行 ML 预测的大型试验的一部分。我们使用的数据来自年龄≥ 35 岁的 LC 患者与对照组患者(1:2)。我们为所有 LC 患者以及有吸烟背景和无吸烟背景的患者建立了一个模型。我们将年龄、性别、体重指数(BMI)、吸烟史、社会经济地位(SES)、慢性阻塞性肺病(COPD)/肺气肿/慢性支气管炎(CB)病史、间质性肺病(ILD)/肺纤维化(PF)病史以及 LC 家族史纳入了模型:在 4076 名患者中,1428 人(35%)属于 LC 组,2648 人(65%)属于对照组。在整个研究人群中,我们的模型准确率为 71.2%,灵敏度为 69%,阳性预测值 (PPV) 为 74%。两个亚组的准确率更高。吸烟人群和从不吸烟人群的准确率分别为 74.8%(灵敏度 72%,PPV 76%)和 73.0%(灵敏度 76%,PPV 72%)。在整个人群和吸烟人群中,慢性阻塞性肺病/肺气肿/慢性阻塞性肺病是最重要的诱因,其次是体重指数和年龄,而在从不吸烟人群中,体重指数、年龄和社会经济地位是最重要的诱因:结论:已知的 LC 风险因素可用于 ML 模型,以适度预测 LC。结论:已知的 LC 风险因素可用于 ML 模型以适度预测 LC。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Cancer Epidemiology
Cancer Epidemiology 医学-肿瘤学
CiteScore
4.50
自引率
3.80%
发文量
200
审稿时长
39 days
期刊介绍: Cancer Epidemiology is dedicated to increasing understanding about cancer causes, prevention and control. The scope of the journal embraces all aspects of cancer epidemiology including: • Descriptive epidemiology • Studies of risk factors for disease initiation, development and prognosis • Screening and early detection • Prevention and control • Methodological issues The journal publishes original research articles (full length and short reports), systematic reviews and meta-analyses, editorials, commentaries and letters to the editor commenting on previously published research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信