Dor Atias, Saar Ashri, Uri Goldbourt, Yael Benyamini, Ran Gilad-Bachrach, Tal Hasin, Yariv Gerber, Uri Obolski
{"title":"Machine learning in epidemiology: An introduction, comparison with traditional methods, and a case study of predicting extreme longevity.","authors":"Dor Atias, Saar Ashri, Uri Goldbourt, Yael Benyamini, Ran Gilad-Bachrach, Tal Hasin, Yariv Gerber, Uri Obolski","doi":"10.1016/j.annepidem.2025.07.024","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Healthcare data volume is increasingly expanding, presenting both challenges and opportunities. Traditional statistical methods applied in epidemiology, such as logistic regression (LR), albeit widely used, holds limited ability to handle the complexity and high dimensionality of modern datasets. In contrast, machine learning (ML) methods can model complex, non-linear relationships and are less constrained by parametric assumptions, ideal for uncovering hidden patterns.</p><p><strong>Methods: </strong>In this study, we aim to introduce ML applications for epidemiologic research and explore three predictive models: LR as a traditional modeling approach, and least absolute shrinkage and selection operator (LASSO) regression and eXtreme Gradient Boosting (XGBoost) as ML approaches. We demonstrate how ML approaches, particularly XGBoost, can benefit epidemiologic research through a real-world case study. We present common steps: data preprocessing, model creation and evaluation processes. Additionally, we address the \"black box\" nature of ML models and present post hoc explanation tools to enhance interpretability.</p><p><strong>Results: </strong>We examined the case of near-centenarianism (reaching age of 95 years or older) prediction using midlife predictors (i.e., demographic, clinical, lifestyle, occupational and dietary variables) in a cohort of approximately 10,000 middle-aged working men recruited in 1963 and followed until death or until 2019. Models were fitted and calibrated on a training set, showing good predictive performances on a separate test set. XGboost, LASSO regression, and LR achieved ROC-AUC values of 0.72 (95 % CI: 0.66-0.75), 0.71 (95 % CI: 0.67-0.74) and 0.69 (95 % CI: 0.66-0.73), respectively. Explainability analysis identified key predictors for longevity, including systolic blood pressure, smoking status, and a history of myocardial infarction; consistent with prior studies.</p><p><strong>Conclusions: </strong>In conclusion, our findings highlight the potential of ML to enhance epidemiological studies by handling complex interactions and high-dimensional data, suggesting a complementary approach to traditional methods.</p>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":" ","pages":"23-33"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.annepidem.2025.07.024","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Healthcare data volume is increasingly expanding, presenting both challenges and opportunities. Traditional statistical methods applied in epidemiology, such as logistic regression (LR), albeit widely used, holds limited ability to handle the complexity and high dimensionality of modern datasets. In contrast, machine learning (ML) methods can model complex, non-linear relationships and are less constrained by parametric assumptions, ideal for uncovering hidden patterns.
Methods: In this study, we aim to introduce ML applications for epidemiologic research and explore three predictive models: LR as a traditional modeling approach, and least absolute shrinkage and selection operator (LASSO) regression and eXtreme Gradient Boosting (XGBoost) as ML approaches. We demonstrate how ML approaches, particularly XGBoost, can benefit epidemiologic research through a real-world case study. We present common steps: data preprocessing, model creation and evaluation processes. Additionally, we address the "black box" nature of ML models and present post hoc explanation tools to enhance interpretability.
Results: We examined the case of near-centenarianism (reaching age of 95 years or older) prediction using midlife predictors (i.e., demographic, clinical, lifestyle, occupational and dietary variables) in a cohort of approximately 10,000 middle-aged working men recruited in 1963 and followed until death or until 2019. Models were fitted and calibrated on a training set, showing good predictive performances on a separate test set. XGboost, LASSO regression, and LR achieved ROC-AUC values of 0.72 (95 % CI: 0.66-0.75), 0.71 (95 % CI: 0.67-0.74) and 0.69 (95 % CI: 0.66-0.73), respectively. Explainability analysis identified key predictors for longevity, including systolic blood pressure, smoking status, and a history of myocardial infarction; consistent with prior studies.
Conclusions: In conclusion, our findings highlight the potential of ML to enhance epidemiological studies by handling complex interactions and high-dimensional data, suggesting a complementary approach to traditional methods.
期刊介绍:
The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.