{"title":"Application of machine learning algorithms in an epidemiologic study of mortality","authors":"George O. Agogo , Henry Mwambi","doi":"10.1016/j.annepidem.2024.12.015","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>Epidemiologic studies are important in assessing risk factors of mortality. Machine learning (ML) is efficient in analyzing multidimensional data to unravel dependencies between risk factors and health outcomes.</div></div><div><h3>Methods</h3><div>Using a representative sample from the National Health and Nutrition Examination Survey data collected from 2009 to 2016 linked to the National Death Index public-use mortality data through December 31, 2019, we applied logistic, random forests, k-Nearest Neighbors, multivariate adaptive regression splines, support vector machines, extreme gradient boosting, and super learner ML algorithms to study risk factors of all-cause mortality. We evaluated the algorithms using area under the receiver operating curve (AUC-ROC), sensitivity, negative predictive value (NPV) among other metrics and interpreted the results using SHapley Additive exPlanation.</div></div><div><h3>Results</h3><div>The AUC-ROC ranged from 0.80 ─ 0.87. The super learner had the highest AUC-ROC of 0.87 (95 % CI, 0.86 ─ 0.88), sensitivity of 0.86 (95 % CI, 0.84 ─ 0.88) and NPV of 0.98 (95 % CI, 0.98 ─ 0.99). Key risk factors of mortality included advanced age, larger waist circumference, male and systolic blood pressure. Being married, high annual household income, and high education level were linked with low risk of mortality.</div></div><div><h3>Conclusions</h3><div>Machine learning can be used to identify risk factors of mortality, which is critical for individualized targeted interventions in epidemiologic studies.</div></div>","PeriodicalId":50767,"journal":{"name":"Annals of Epidemiology","volume":"102 ","pages":"Pages 36-47"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047279724002874","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
Epidemiologic studies are important in assessing risk factors of mortality. Machine learning (ML) is efficient in analyzing multidimensional data to unravel dependencies between risk factors and health outcomes.
Methods
Using a representative sample from the National Health and Nutrition Examination Survey data collected from 2009 to 2016 linked to the National Death Index public-use mortality data through December 31, 2019, we applied logistic, random forests, k-Nearest Neighbors, multivariate adaptive regression splines, support vector machines, extreme gradient boosting, and super learner ML algorithms to study risk factors of all-cause mortality. We evaluated the algorithms using area under the receiver operating curve (AUC-ROC), sensitivity, negative predictive value (NPV) among other metrics and interpreted the results using SHapley Additive exPlanation.
Results
The AUC-ROC ranged from 0.80 ─ 0.87. The super learner had the highest AUC-ROC of 0.87 (95 % CI, 0.86 ─ 0.88), sensitivity of 0.86 (95 % CI, 0.84 ─ 0.88) and NPV of 0.98 (95 % CI, 0.98 ─ 0.99). Key risk factors of mortality included advanced age, larger waist circumference, male and systolic blood pressure. Being married, high annual household income, and high education level were linked with low risk of mortality.
Conclusions
Machine learning can be used to identify risk factors of mortality, which is critical for individualized targeted interventions in epidemiologic studies.
期刊介绍:
The journal emphasizes the application of epidemiologic methods to issues that affect the distribution and determinants of human illness in diverse contexts. Its primary focus is on chronic and acute conditions of diverse etiologies and of major importance to clinical medicine, public health, and health care delivery.