利用历史数据预测SARS-CoV-2突变频率趋势

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-10-02 DOI:10.1093/bioinformatics/btaf508

Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao

{"title":"利用历史数据预测SARS-CoV-2突变频率趋势","authors":"Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao","doi":"10.1093/bioinformatics/btaf508","DOIUrl":null,"url":null,"abstract":"Motivation: As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.Results: In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.Availability and implementation: To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12502910/pdf/","citationCount":"0","resultStr":"{\"title\":\"Predicting the trend of SARS-CoV-2 mutation frequencies using historical data.\",\"authors\":\"Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao\",\"doi\":\"10.1093/bioinformatics/btaf508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.Results: In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.Availability and implementation: To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12502910/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf508\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机：随着SARS-CoV-2病毒的快速进化，预测病毒突变轨迹已成为一项关键而复杂的任务。深入了解未来的突变模式，特别是在不久的将来会流行的突变，对于指导疾病控制的诊断、治疗和疫苗战略至关重要。结果：在本研究中，我们建立了一个模型，利用美国的历史突变频率数据实时预测未来的SARS-CoV-2突变激增。我们使用滑动窗口方法将时间预测问题转换为监督学习框架。这包括将突变频率的时间序列分解成非常短的片段。考虑到数据的时间依赖性，我们着重于对突变频率的一阶导数进行建模。我们利用各种机器学习方法，包括随机森林、XGBoost、支持向量机和神经网络模型，在前面的导数的基础上预测每个段的最终导数。在新的转化策略和机器学习模型的高容量的支持下，我们观察到在预测未来30天和80天的突变率时，预测误差分别限制在0.1%和1%以内。此外，与传统的时间序列模型相比，该方法的预测精度也有显著提高，在不同时间范围内进行预测的平均绝对误差（MAE）和均方误差（MSE）大大降低。为了进一步评估该方法在预测不可预见突变突变模式方面的有效性和稳健性，我们首先设计了一个综合案例，将所有突变分为三种主要模式。在对两种模式类别的数据进行训练并对第三种模式类别进行测试时，该模型通过准确预测未见的突变模式证明了其鲁棒性，展示了其在预测各种突变轨迹方面的潜力。然后，我们将我们的方法应用于预测2025年1月1日至2025年6月10日之间的近期时间框架，适用于美国和英国，其中模型训练是使用2019年12月12日至2023年1月26日在美国收集的频率序列数据进行的。该模型在两个数据集上都表现出优异的性能。可用性：为了增强可访问性和实用性，我们将我们的方法构建到GitHub包中（https://github.com/ZhouXY199502/SWD）。我们的方法具有潜在的适用性，可用于研究其他传染病或预测任务，从而将其相关性扩展到当前的COVID大流行之外。补充信息：补充数据可在生物信息学网站在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predicting the trend of SARS-CoV-2 mutation frequencies using historical data.

Motivation: As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.

Results: In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.

Availability and implementation: To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量