An ensemble approach improves the prediction of the COVID-19 pandemic in South Korea.

IF 4.5 3区医学 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH

Journal of Global Health Pub Date : 2025-03-28 DOI:10.7189/jogh.15.04079

Kyulhee Han, Catherine Apio, Hanbyul Song, Bogyeom Lee, Xuwen Hu, Jiwon Park, Liu Zhe, Taewan Goo, Taesung Park

{"title":"An ensemble approach improves the prediction of the COVID-19 pandemic in South Korea.","authors":"Kyulhee Han, Catherine Apio, Hanbyul Song, Bogyeom Lee, Xuwen Hu, Jiwon Park, Liu Zhe, Taewan Goo, Taesung Park","doi":"10.7189/jogh.15.04079","DOIUrl":null,"url":null,"abstract":"Background: Modelling can contribute to disease prevention and control strategies. Accurate predictions of future cases and mortality rates were essential for establishing appropriate policies during the COVID-19 pandemic. However, no single model yielded definite conclusions, with each having specific strengths and weaknesses. Here we propose an ensemble learning approach which can offset the limitations of each model and improve prediction performances.Methods: We generated predictions for the transmission and impact of COVID-19 in South Korea using seven individual models, including mathematical, statistical, and machine learning approaches. We integrated these predictions using three ensemble methods: stacking, average, and weighted average ensemble (WAE). We used train and test errors to measure a model's performance and selected the best covariate combinations based on the lowest train error. We then evaluated model performance using five error measures (r2, weighted mean absolute percentage error (WMAPE), autoregressive integrated moving average (ARIMA), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE)) and selected the optimal covariate combination accordingly. To validate the generalisability of our approach, we applied the same modelling framework to USA data.Results: Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data evaluated using the WMAPE, individual models achieved the following: Generalised additive modelling (GAM) reached a value of 0.244 for the daily number of confirmed cases, a value of 0.172 for the time series Poisson for the daily number of confirmed deaths, and a value of 0.022 for both ARIMA and time series Poisson for the daily number of ICU patients. For smoothed data, the Holt-Winters model achieved a value of 0.058 for daily confirmed cases, while ARIMA attained a value of 0.058 for the daily number of confirmed deaths and 0.013 for the daily number of ICU patients. Among ensemble models, the SVM-based stacking ensemble achieved error values of 0.235 for the daily number of confirmed cases, 0.118 for the daily number of deaths, and 0.019 for the daily number of ICU patients on raw data. For smoothed data, the average ensemble and weighted average ensemble achieved 0.060 for the daily number of confirmed cases and 0.013 for daily ICU patients. The ensemble models also generalised well when applied to data from the USA.Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data, GAM (0.244) predicted daily confirmed cases best, time series Poisson (0.172) predicted daily confirmed deaths, and both ARIMA and time series Poisson (0.022) predicted daily ICU patients, based on WMAPE. For smoothed data, time series Poisson predicted daily confirmed cases (0.065) best, while ARIMA best predicted daily confirmed deaths (0.058) and ICU patients (0.013). For ensemble models, stacking ensemble using SVM was the best model for predicting daily confirmed cases (0.228), deaths (0.11), and ICU patients (0.02). With smoothed data, average ensemble and WAE were the best models for predicting daily confirmed cases (0.058) and ICU patients (0.011). The performance of ensemble models was generalised to other countries using the USA data for predictive performance.Conclusions: No single model performed consistently. While the ensemble models did not always provide the best predictions, a comparison of first-best and second-best models showed that they performed considerably better than the single models. If an ensemble model was not the best performing model, its performance was always not far from the best single model: a look at the mean and variance of the error measures shows that ensemble models provided stable predictions without much variation in their performances compared to single models. These results can be used to inform policymaking during future pandemics.","PeriodicalId":48734,"journal":{"name":"Journal of Global Health","volume":"15 ","pages":"04079"},"PeriodicalIF":4.5000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11949510/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Global Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.7189/jogh.15.04079","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Modelling can contribute to disease prevention and control strategies. Accurate predictions of future cases and mortality rates were essential for establishing appropriate policies during the COVID-19 pandemic. However, no single model yielded definite conclusions, with each having specific strengths and weaknesses. Here we propose an ensemble learning approach which can offset the limitations of each model and improve prediction performances.

Methods: We generated predictions for the transmission and impact of COVID-19 in South Korea using seven individual models, including mathematical, statistical, and machine learning approaches. We integrated these predictions using three ensemble methods: stacking, average, and weighted average ensemble (WAE). We used train and test errors to measure a model's performance and selected the best covariate combinations based on the lowest train error. We then evaluated model performance using five error measures (r², weighted mean absolute percentage error (WMAPE), autoregressive integrated moving average (ARIMA), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE)) and selected the optimal covariate combination accordingly. To validate the generalisability of our approach, we applied the same modelling framework to USA data.

Results: Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data evaluated using the WMAPE, individual models achieved the following: Generalised additive modelling (GAM) reached a value of 0.244 for the daily number of confirmed cases, a value of 0.172 for the time series Poisson for the daily number of confirmed deaths, and a value of 0.022 for both ARIMA and time series Poisson for the daily number of ICU patients. For smoothed data, the Holt-Winters model achieved a value of 0.058 for daily confirmed cases, while ARIMA attained a value of 0.058 for the daily number of confirmed deaths and 0.013 for the daily number of ICU patients. Among ensemble models, the SVM-based stacking ensemble achieved error values of 0.235 for the daily number of confirmed cases, 0.118 for the daily number of deaths, and 0.019 for the daily number of ICU patients on raw data. For smoothed data, the average ensemble and weighted average ensemble achieved 0.060 for the daily number of confirmed cases and 0.013 for daily ICU patients. The ensemble models also generalised well when applied to data from the USA.Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data, GAM (0.244) predicted daily confirmed cases best, time series Poisson (0.172) predicted daily confirmed deaths, and both ARIMA and time series Poisson (0.022) predicted daily ICU patients, based on WMAPE. For smoothed data, time series Poisson predicted daily confirmed cases (0.065) best, while ARIMA best predicted daily confirmed deaths (0.058) and ICU patients (0.013). For ensemble models, stacking ensemble using SVM was the best model for predicting daily confirmed cases (0.228), deaths (0.11), and ICU patients (0.02). With smoothed data, average ensemble and WAE were the best models for predicting daily confirmed cases (0.058) and ICU patients (0.011). The performance of ensemble models was generalised to other countries using the USA data for predictive performance.

Conclusions: No single model performed consistently. While the ensemble models did not always provide the best predictions, a comparison of first-best and second-best models showed that they performed considerably better than the single models. If an ensemble model was not the best performing model, its performance was always not far from the best single model: a look at the mean and variance of the error measures shows that ensemble models provided stable predictions without much variation in their performances compared to single models. These results can be used to inform policymaking during future pandemics.

查看原文本刊更多论文

综合方法提高了对韩国新冠肺炎大流行的预测。

背景：建模有助于疾病预防和控制策略。准确预测未来病例和死亡率对于在2019冠状病毒病大流行期间制定适当政策至关重要。然而，没有一个单一的模型得出明确的结论，每个模型都有特定的优点和缺点。在这里，我们提出了一种集成学习方法，可以抵消每个模型的局限性，提高预测性能。方法：我们使用七个单独的模型，包括数学、统计和机器学习方法，对COVID-19在韩国的传播和影响进行了预测。我们使用三种集成方法整合这些预测：堆叠、平均和加权平均集成（WAE）。我们使用训练误差和测试误差来衡量模型的性能，并根据最小的训练误差选择最佳协变量组合。然后，我们使用五种误差度量（r2、加权平均绝对百分比误差（WMAPE）、自回归综合移动平均（ARIMA）、均方误差（MSE）、均方根误差（RMSE）和平均绝对百分比误差（MAPE））评估模型的性能，并据此选择最佳协变量组合。为了验证我们方法的通用性，我们将相同的建模框架应用于美国数据。结果：增强剂接种率+ Omicron变异ba5率是最常选择的协变量组合。对于使用WMAPE评估的原始数据，单个模型实现了以下结果：广义加性模型（GAM）每日确诊病例数的值为0.244，每日确诊死亡人数的时间序列泊松值为0.172，ARIMA和时间序列泊松值均为0.022 ICU患者日人数。对于平滑数据，Holt-Winters模型对每日确诊病例的计算值为0.058，而ARIMA对每日确诊死亡人数的计算值为0.058，对每日ICU患者人数的计算值为0.013。在集成模型中，基于svm的叠加集成对原始数据的日确诊病例数、日死亡人数和ICU患者日人数的误差值分别为0.235、0.118和0.019。对于平滑数据，每日确诊病例数的平均ensemble和加权平均ensemble分别达到0.060和0.013。当应用于来自美国的数据时，集成模型也具有很好的泛化性。增强剂接种率+ Omicron变异ba5率是最常选择的协变量组合。对于原始数据，基于WMAPE的GAM（0.244）预测每日确诊病例最好，时间序列泊松（0.172）预测每日确诊死亡，ARIMA和时间序列泊松（0.022）预测每日ICU患者。对于平滑数据，时间序列泊松预测每日确诊病例（0.065）最好，而ARIMA预测每日确诊死亡（0.058）和ICU患者（0.013）最好。对于集成模型，使用支持向量机的堆叠集成是预测每日确诊病例（0.228）、死亡（0.11）和ICU患者（0.02）的最佳模型。对数据进行平滑处理后，平均集合和WAE是预测每日确诊病例（0.058）和ICU患者（0.011）的最佳模型。集成模型的性能被推广到其他国家，使用美国数据进行预测性能。结论：没有单一模型表现一致。虽然集合模型并不总是提供最好的预测，但第一最佳和第二最佳模型的比较表明，它们比单一模型表现得要好得多。如果一个集成模型不是表现最好的模型，它的表现总是与最好的单一模型相差不远：看看误差测量的均值和方差，就会发现集成模型提供了稳定的预测，与单一模型相比，它们的表现没有太大的变化。这些结果可用于为未来大流行期间的决策提供信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Global Health PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH -

CiteScore

6.10

自引率

2.80%

发文量

240

审稿时长

6 weeks

期刊介绍： Journal of Global Health is a peer-reviewed journal published by the Edinburgh University Global Health Society, a not-for-profit organization registered in the UK. We publish editorials, news, viewpoints, original research and review articles in two issues per year.