Machine learning in epidemiology: An introduction, comparison with traditional methods, and a case study of predicting extreme longevity.

IF 3 3区 医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Dor Atias, Saar Ashri, Uri Goldbourt, Yael Benyamini, Ran Gilad-Bachrach, Tal Hasin, Yariv Gerber, Uri Obolski
{"title":"Machine learning in epidemiology: An introduction, comparison with traditional methods, and a case study of predicting extreme longevity.","authors":"Dor Atias, Saar Ashri, Uri Goldbourt, Yael Benyamini, Ran Gilad-Bachrach, Tal Hasin, Yariv Gerber, Uri Obolski","doi":"10.1016/j.annepidem.2025.07.024","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Healthcare data volume is increasingly expanding, presenting both challenges and opportunities. Traditional statistical methods applied in epidemiology, such as logistic regression (LR), albeit widely used, holds limited ability to handle the complexity and high dimensionality of modern datasets. In contrast, machine learning (ML) methods can model complex, non-linear relationships and are less constrained by parametric assumptions, ideal for uncovering hidden patterns.</p><p><strong>Methods: </strong>In this study, we aim to introduce ML applications for epidemiologic research and explore three predictive models: LR as a traditional modeling approach, and least absolute shrinkage and selection operator (LASSO) regression and eXtreme Gradient Boosting (XGBoost) as ML approaches. We demonstrate how ML approaches, particularly XGBoost, can benefit epidemiologic research through a real-world case study. We present common steps: data preprocessing, model creation and evaluation processes. Additionally, we address the \"black box\" nature of ML models and present post hoc explanation tools to enhance interpretability.</p><p><strong>Results: </strong>We examined the case of near-centenarianism (reaching age of 95 years or older) prediction using midlife predictors (i.e., demographic, clinical, lifestyle, occupational and dietary variables) in a cohort of approximately 10,000 middle-aged working men recruited in 1963 and followed until death or until 2019. Models were fitted and calibrated on a training set, showing good predictive performances on a separate test set. XGboost, LASSO regression, and LR achieved ROC-AUC values of 0.72 (95 % CI: 0.66-0.75), 0.71 (95 % CI: 0.67-0.74) and 0.69 (95 % CI: 0.66-0.73), respectively. Explainability analysis identified key predictors for longevity, including systolic blood pressure, smoking status, and a history of myocardial infarction; consistent with prior studies.</p><p><strong>Conclusions: </strong>In conclusion, our findings highlight the potential of ML to enhance epidemiological studies by handling complex interactions and high-dimensional data, suggesting a complementary approach to traditional methods.</p>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":" ","pages":"23-33"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.annepidem.2025.07.024","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Healthcare data volume is increasingly expanding, presenting both challenges and opportunities. Traditional statistical methods applied in epidemiology, such as logistic regression (LR), albeit widely used, holds limited ability to handle the complexity and high dimensionality of modern datasets. In contrast, machine learning (ML) methods can model complex, non-linear relationships and are less constrained by parametric assumptions, ideal for uncovering hidden patterns.

Methods: In this study, we aim to introduce ML applications for epidemiologic research and explore three predictive models: LR as a traditional modeling approach, and least absolute shrinkage and selection operator (LASSO) regression and eXtreme Gradient Boosting (XGBoost) as ML approaches. We demonstrate how ML approaches, particularly XGBoost, can benefit epidemiologic research through a real-world case study. We present common steps: data preprocessing, model creation and evaluation processes. Additionally, we address the "black box" nature of ML models and present post hoc explanation tools to enhance interpretability.

Results: We examined the case of near-centenarianism (reaching age of 95 years or older) prediction using midlife predictors (i.e., demographic, clinical, lifestyle, occupational and dietary variables) in a cohort of approximately 10,000 middle-aged working men recruited in 1963 and followed until death or until 2019. Models were fitted and calibrated on a training set, showing good predictive performances on a separate test set. XGboost, LASSO regression, and LR achieved ROC-AUC values of 0.72 (95 % CI: 0.66-0.75), 0.71 (95 % CI: 0.67-0.74) and 0.69 (95 % CI: 0.66-0.73), respectively. Explainability analysis identified key predictors for longevity, including systolic blood pressure, smoking status, and a history of myocardial infarction; consistent with prior studies.

Conclusions: In conclusion, our findings highlight the potential of ML to enhance epidemiological studies by handling complex interactions and high-dimensional data, suggesting a complementary approach to traditional methods.

流行病学中的机器学习:介绍,与传统方法的比较,以及预测极端寿命的案例研究。
背景:医疗保健数据量日益扩大,挑战与机遇并存。传统的统计方法应用于流行病学,如逻辑回归(LR),尽管广泛使用,但处理现代数据集的复杂性和高维性的能力有限。相比之下,机器学习(ML)方法可以模拟复杂的非线性关系,并且受参数假设的约束较少,是发现隐藏模式的理想选择。方法:在本研究中,我们旨在介绍ML在流行病学研究中的应用,并探索三种预测模型:LR作为传统的建模方法,最小绝对收缩和选择算子(LASSO)回归和极限梯度增强(XGBoost)作为ML方法。我们通过现实世界的案例研究展示了ML方法,特别是XGBoost如何有益于流行病学研究。我们介绍了常见的步骤:数据预处理、模型创建和评估过程。此外,我们解决了机器学习模型的“黑箱”性质,并提出了事后解释工具来增强可解释性。结果:我们使用中年预测因子(即人口统计学、临床、生活方式、职业和饮食变量)对1963年招募的约10,000名中年工作男性进行了近百岁(达到95岁或以上)预测,并随访至死亡或2019年。模型在训练集上进行了拟合和校准,在单独的测试集上显示出良好的预测性能。XGboost、LASSO回归和LR的ROC-AUC值分别为0.72 (95% CI: 0.66-0.75)、0.71 (95% CI: 0.67-0.74)和0.69 (95% CI: 0.66-0.73)。可解释性分析确定了长寿的关键预测因素,包括收缩压、吸烟状况和心肌梗死史;与之前的研究一致。结论:总之,我们的研究结果强调了ML通过处理复杂的相互作用和高维数据来增强流行病学研究的潜力,为传统方法提供了补充方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Epidemiology
Annals of Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
7.40
自引率
1.80%
发文量
207
审稿时长
59 days
期刊介绍: The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信