Machine learning approaches for real-time ZIP code and county-level estimation of state-wide infectious disease hospitalizations using local health system data
{"title":"Machine learning approaches for real-time ZIP code and county-level estimation of state-wide infectious disease hospitalizations using local health system data","authors":"Tanvir Ahammed , Md Sakhawat Hossain , Christopher McMahan , Lior Rennert","doi":"10.1016/j.epidem.2025.100823","DOIUrl":null,"url":null,"abstract":"<div><div>The lack of conventional methods of estimating real-time infectious disease burden in granular regions inhibits timely and efficient public health response. Comprehensive data sources (e.g., state health department data) typically needed for such estimation are often limited due to 1) substantial delays in data reporting and 2) lack of geographic granularity in data provided to researchers. Leveraging real-time local health system data presents an opportunity to overcome these challenges. This study evaluates the effectiveness of machine learning and statistical approaches using local health system data to estimate current and previous COVID-19 hospitalizations in South Carolina. Random Forest models demonstrated consistently higher average median percent agreement accuracy compared to generalized linear mixed models for current weekly hospitalizations across 123 ZIP codes (72.29 %, IQR: 63.20–75.62 %) and 28 counties (76.43 %, IQR: 70.33–81.16 %) with sufficient health system coverage. To account for underrepresented populations in health systems, we combined Random Forest models with Classification and Regression Trees (CART) for imputation. The average median percent agreement was 61.02 % (IQR: 51.17–72.29 %) for all ZIP codes and 72.64 % (IQR: 66.13–77.69 %) for all counties. Median percent agreement for cumulative hospitalizations over the previous 6 months was 80.98 % (IQR: 68.99–89.66 %) for all ZIP codes and 81.17 % (IQR: 68.55–91.33 %) for all counties. These findings emphasize the effectiveness of utilizing real-time health system data to estimate infectious disease burden. Moreover, the methodologies developed in this study can be adapted to estimate hospitalizations for other diseases, offering a valuable tool for public health officials to respond swiftly and effectively to various health crises.</div></div>","PeriodicalId":49206,"journal":{"name":"Epidemics","volume":"51 ","pages":"Article 100823"},"PeriodicalIF":3.0000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epidemics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1755436525000118","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}
引用次数: 0
Abstract
The lack of conventional methods of estimating real-time infectious disease burden in granular regions inhibits timely and efficient public health response. Comprehensive data sources (e.g., state health department data) typically needed for such estimation are often limited due to 1) substantial delays in data reporting and 2) lack of geographic granularity in data provided to researchers. Leveraging real-time local health system data presents an opportunity to overcome these challenges. This study evaluates the effectiveness of machine learning and statistical approaches using local health system data to estimate current and previous COVID-19 hospitalizations in South Carolina. Random Forest models demonstrated consistently higher average median percent agreement accuracy compared to generalized linear mixed models for current weekly hospitalizations across 123 ZIP codes (72.29 %, IQR: 63.20–75.62 %) and 28 counties (76.43 %, IQR: 70.33–81.16 %) with sufficient health system coverage. To account for underrepresented populations in health systems, we combined Random Forest models with Classification and Regression Trees (CART) for imputation. The average median percent agreement was 61.02 % (IQR: 51.17–72.29 %) for all ZIP codes and 72.64 % (IQR: 66.13–77.69 %) for all counties. Median percent agreement for cumulative hospitalizations over the previous 6 months was 80.98 % (IQR: 68.99–89.66 %) for all ZIP codes and 81.17 % (IQR: 68.55–91.33 %) for all counties. These findings emphasize the effectiveness of utilizing real-time health system data to estimate infectious disease burden. Moreover, the methodologies developed in this study can be adapted to estimate hospitalizations for other diseases, offering a valuable tool for public health officials to respond swiftly and effectively to various health crises.
期刊介绍:
Epidemics publishes papers on infectious disease dynamics in the broadest sense. Its scope covers both within-host dynamics of infectious agents and dynamics at the population level, particularly the interaction between the two. Areas of emphasis include: spread, transmission, persistence, implications and population dynamics of infectious diseases; population and public health as well as policy aspects of control and prevention; dynamics at the individual level; interaction with the environment, ecology and evolution of infectious diseases, as well as population genetics of infectious agents.