Mining Social Media Data to Predict COVID-19 Case Counts.

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics Pub Date : 2022-06-01 Epub Date: 2022-09-08 DOI:10.1109/ichi54592.2022.00027

Maksims Kazijevs, Furkan A Akyelken, Manar D Samad

{"title":"Mining Social Media Data to Predict COVID-19 Case Counts.","authors":"Maksims Kazijevs, Furkan A Akyelken, Manar D Samad","doi":"10.1109/ichi54592.2022.00027","DOIUrl":null,"url":null,"abstract":"The unpredictability and unknowns surrounding the ongoing coronavirus disease (COVID-19) pandemic have led to an unprecedented consequence taking a heavy toll on the lives and economies of all countries. There have been efforts to predict COVID-19 case counts (CCC) using epidemiological data and numerical tokens online, which may allow early preventive measures to slow the spread of the disease. In this paper, we use state-of-the-art natural language processing (NLP) algorithms to numerically encode COVID-19 related tweets originated from eight cities in the United States and predict city-specific CCC up to eight days in the future. A city-embedding is proposed to obtain a time series representation of daily tweets posted from a city, which is then used to predict case counts using a custom long-short term memory (LSTM) model. The universal sentence encoder yields the best normalized root mean squared error (NRMSE) 0.090 (0.039), averaged across all cities in predicting CCC six days in the future. The R 2 scores in predicting CCC are more than 0.70 and often over 0.8, which suggests a strong correlation between the actual and our model predicted CCC values. Our analyses show that the NRMSE and R 2 scores are consistently robust across different cities and different numbers of time steps in time series data. Results show that the LSTM model can learn the mapping between the NLP-encoded tweet semantics and the case counts, which infers that social media text can be directly mined to identify the future course of the pandemic.","PeriodicalId":73284,"journal":{"name":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","volume":" ","pages":"104-111"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9490453/pdf/nihms-1836082.pdf","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ichi54592.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/9/8 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The unpredictability and unknowns surrounding the ongoing coronavirus disease (COVID-19) pandemic have led to an unprecedented consequence taking a heavy toll on the lives and economies of all countries. There have been efforts to predict COVID-19 case counts (CCC) using epidemiological data and numerical tokens online, which may allow early preventive measures to slow the spread of the disease. In this paper, we use state-of-the-art natural language processing (NLP) algorithms to numerically encode COVID-19 related tweets originated from eight cities in the United States and predict city-specific CCC up to eight days in the future. A city-embedding is proposed to obtain a time series representation of daily tweets posted from a city, which is then used to predict case counts using a custom long-short term memory (LSTM) model. The universal sentence encoder yields the best normalized root mean squared error (NRMSE) 0.090 (0.039), averaged across all cities in predicting CCC six days in the future. The R ² scores in predicting CCC are more than 0.70 and often over 0.8, which suggests a strong correlation between the actual and our model predicted CCC values. Our analyses show that the NRMSE and R ² scores are consistently robust across different cities and different numbers of time steps in time series data. Results show that the LSTM model can learn the mapping between the NLP-encoded tweet semantics and the case counts, which infers that social media text can be directly mined to identify the future course of the pandemic.

Abstract Image

查看原文本刊更多论文

挖掘社交媒体数据预测COVID-19病例数

正在进行的冠状病毒病(COVID-19)大流行的不可预测性和不确定性导致了前所未有的后果，给所有国家的生命和经济造成了沉重打击。人们一直在努力利用流行病学数据和数字代币在线预测COVID-19病例数(CCC)，这可能有助于采取早期预防措施，减缓疾病的传播。在本文中，我们使用最先进的自然语言处理(NLP)算法对来自美国8个城市的COVID-19相关推文进行数字编码，并预测未来8天内特定城市的CCC。提出了一种城市嵌入方法，以获得来自城市的每日tweet的时间序列表示，然后使用自定义的长短期记忆(LSTM)模型来预测案例数。通用句子编码器在预测未来6天的CCC时，在所有城市中产生的最佳标准化均方根误差(NRMSE)为0.090(0.039)。预测CCC的r2得分均在0.70以上，往往在0.8以上，表明实际预测的CCC值与模型预测的CCC值具有较强的相关性。我们的分析表明，在时间序列数据中，NRMSE和r2分数在不同城市和不同时间步长的数据中都具有一致性的稳健性。结果表明，LSTM模型可以学习nlp编码的推文语义与病例数之间的映射，这意味着可以直接挖掘社交媒体文本来识别大流行的未来进程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics

自引率

0.00%

发文量