{"title":"The prediction of new Covid-19 cases in Poland with machine learning models","authors":"Adam Chwila","doi":"10.59170/stattrans-2023-020","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has had a huge impact both on the global economy and on\n everyday life in all countries all over the world. In this paper, we propose several\n possible machine learning approaches to forecasting new confirmed COVID-19 cases,\n including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector\n Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods\n are applied in two variants: to the data prepared for the whole Poland and to the data\n prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of\n all the models has been performed in two variants: with the 5-fold time-series\n cross-validation as well as with the split into the single train and test subsets. The\n computations in the study used official statistics from government reports from the\n period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model\n selection to detect the model characterized by the best ex-post prediction accuracy. The\n scenarios differ from each other by the following features: the machine learning model,\n the method for the hyperparameters selection and the data setup. The most accurate\n scenario for the LASSO and SVR machine learning approaches is the single train/test\n dataset split with data for the whole Poland, while in case of the LSTM and GB trees it\n is the cross validation with data for whole Poland. Among the best scenarios for each\n model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing\n best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with\n the Shapley values. The Shapley values make it possible to present the impact of\n auxiliary variables in the machine learning model on the actual predicted value. The\n knowledge regarding factors that have the strongest impact on the number of new\n infections can help companies to plan their economic activity during turbulent times of\n pandemics. We propose to identify and compare the most important variables that affect\n both the train and test datasets of the model.","PeriodicalId":37985,"journal":{"name":"Statistics in Transition","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Transition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59170/stattrans-2023-020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0
Abstract
The COVID-19 pandemic has had a huge impact both on the global economy and on
everyday life in all countries all over the world. In this paper, we propose several
possible machine learning approaches to forecasting new confirmed COVID-19 cases,
including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector
Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods
are applied in two variants: to the data prepared for the whole Poland and to the data
prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of
all the models has been performed in two variants: with the 5-fold time-series
cross-validation as well as with the split into the single train and test subsets. The
computations in the study used official statistics from government reports from the
period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model
selection to detect the model characterized by the best ex-post prediction accuracy. The
scenarios differ from each other by the following features: the machine learning model,
the method for the hyperparameters selection and the data setup. The most accurate
scenario for the LASSO and SVR machine learning approaches is the single train/test
dataset split with data for the whole Poland, while in case of the LSTM and GB trees it
is the cross validation with data for whole Poland. Among the best scenarios for each
model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing
best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with
the Shapley values. The Shapley values make it possible to present the impact of
auxiliary variables in the machine learning model on the actual predicted value. The
knowledge regarding factors that have the strongest impact on the number of new
infections can help companies to plan their economic activity during turbulent times of
pandemics. We propose to identify and compare the most important variables that affect
both the train and test datasets of the model.
期刊介绍:
Statistics in Transition (SiT) is an international journal published jointly by the Polish Statistical Association (PTS) and the Central Statistical Office of Poland (CSO/GUS), which sponsors this publication. Launched in 1993, it was issued twice a year until 2006; since then it appears - under a slightly changed title, Statistics in Transition new series - three times a year; and after 2013 as a regular quarterly journal." The journal provides a forum for exchange of ideas and experience amongst members of international community of statisticians, data producers and users, including researchers, teachers, policy makers and the general public. Its initially dominating focus on statistical issues pertinent to transition from centrally planned to a market-oriented economy has gradually been extended to embracing statistical problems related to development and modernization of the system of public (official) statistics, in general.