Leveraging machine learning methods to predict COVID-19 vulnerability in U.S. counties based on socioeconomic factors

STEM Fellowship Journal Pub Date : 2022-03-31 DOI:10.17975/sfj-2022-004

Katharine Emily Lee, Cynthia Denise Lo, William Ren Xu, Robert Ye, C. Tomkins-Lane

{"title":"Leveraging machine learning methods to predict COVID-19 vulnerability in U.S. counties based on socioeconomic factors","authors":"Katharine Emily Lee, Cynthia Denise Lo, William Ren Xu, Robert Ye, C. Tomkins-Lane","doi":"10.17975/sfj-2022-004","DOIUrl":null,"url":null,"abstract":"As COVID-19 gained pandemic status, the number of confirmed cases in the US surpassed that of all other countries. Although the virus spread throughout the US, not all areas were affected equally. This retrospective study aims to explore these inequalities through pre-pandemic socioeconomic characteristics by attempting to create a predictive model for COVID-19 vulnerability at the county level. A total of 103 features of socioeconomic data for 2610 US counties (out of a total of 3007) were sourced from various online databases such as the US Census Bureau, the US Department of Agriculture, and the Association of American Medical Colleges. Additionally, to quantify each county’s COVID-19 vulnerability, we defined 3 custom measures: incidence, mortality, and case fatality. These measurements were calculated using case and death data taken 29 days after each county’s first case. Machine learning classification algorithms – including random forest, multi-layer perceptron neural network and XGBoost – were then used to predict the incidence, mortality, and case fatality of US counties. Through analysis, we were able to predict a county’s COVID-19 incidence with ~47% accuracy, mortality with ~59% accuracy, and case fatality with ~61% accuracy by looking solely at pre-pandemic socioeconomic factors. A list of important features was extracted using a built-in XGBoost function for each vulnerability measure (incidence, mortality, and case fatality). Many of these features are typically associated with pandemic spread (e.g., population density and medical infrastructure), while other features were unexpected (e.g., education) and warrant further studies to identify their role in disease propagation. Furthermore, the difficulties our model experienced support the notion that region-specific policies play an important role in successfully mitigating this crisis. The moderate success achieved in this study proves the feasibility of using classifiers as a pandemic preparedness evaluation tool.","PeriodicalId":268438,"journal":{"name":"STEM Fellowship Journal","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"STEM Fellowship Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17975/sfj-2022-004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As COVID-19 gained pandemic status, the number of confirmed cases in the US surpassed that of all other countries. Although the virus spread throughout the US, not all areas were affected equally. This retrospective study aims to explore these inequalities through pre-pandemic socioeconomic characteristics by attempting to create a predictive model for COVID-19 vulnerability at the county level. A total of 103 features of socioeconomic data for 2610 US counties (out of a total of 3007) were sourced from various online databases such as the US Census Bureau, the US Department of Agriculture, and the Association of American Medical Colleges. Additionally, to quantify each county’s COVID-19 vulnerability, we defined 3 custom measures: incidence, mortality, and case fatality. These measurements were calculated using case and death data taken 29 days after each county’s first case. Machine learning classification algorithms – including random forest, multi-layer perceptron neural network and XGBoost – were then used to predict the incidence, mortality, and case fatality of US counties. Through analysis, we were able to predict a county’s COVID-19 incidence with ~47% accuracy, mortality with ~59% accuracy, and case fatality with ~61% accuracy by looking solely at pre-pandemic socioeconomic factors. A list of important features was extracted using a built-in XGBoost function for each vulnerability measure (incidence, mortality, and case fatality). Many of these features are typically associated with pandemic spread (e.g., population density and medical infrastructure), while other features were unexpected (e.g., education) and warrant further studies to identify their role in disease propagation. Furthermore, the difficulties our model experienced support the notion that region-specific policies play an important role in successfully mitigating this crisis. The moderate success achieved in this study proves the feasibility of using classifiers as a pandemic preparedness evaluation tool.

查看原文本刊更多论文

利用机器学习方法根据社会经济因素预测美国各县的COVID-19脆弱性

随着新冠肺炎成为大流行，美国的确诊病例数量超过了其他所有国家。尽管该病毒在美国各地传播，但并非所有地区都受到了同样的影响。本回顾性研究旨在通过建立县一级COVID-19脆弱性预测模型，通过大流行前的社会经济特征探索这些不平等现象。美国2610个县(总共3007个县)的社会经济数据共有103个特征，来自各种在线数据库，如美国人口普查局、美国农业部和美国医学院协会。此外，为了量化每个国家的COVID-19脆弱性，我们定义了3个自定义指标:发病率、死亡率和病死率。这些测量是根据每个县第一例病例后29天的病例和死亡数据计算的。然后使用机器学习分类算法——包括随机森林、多层感知器神经网络和XGBoost——来预测美国各县的发病率、死亡率和病死率。通过分析，我们能够仅通过观察大流行前的社会经济因素，预测一个县的COVID-19发病率的准确率为47%，死亡率的准确率为59%，病死率的准确率为61%。使用内置的XGBoost函数为每个漏洞度量(发生率、死亡率和病死率)提取重要特性列表。其中许多特征通常与大流行传播有关(例如，人口密度和医疗基础设施)，而其他特征则出乎意料(例如，教育)，需要进一步研究以确定它们在疾病传播中的作用。此外，我们的模型遇到的困难支持了这样一种观点，即针对特定区域的政策在成功缓解这场危机方面发挥了重要作用。本研究取得的适度成功证明了使用分类器作为大流行防范评估工具的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

STEM Fellowship Journal

自引率

0.00%

发文量