Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao
{"title":"利用历史数据预测SARS-CoV-2突变频率趋势","authors":"Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao","doi":"10.1093/bioinformatics/btaf508","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.</p><p><strong>Results: </strong>In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.</p><p><strong>Availability and implementation: </strong>To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12502910/pdf/","citationCount":"0","resultStr":"{\"title\":\"Predicting the trend of SARS-CoV-2 mutation frequencies using historical data.\",\"authors\":\"Xinyu Zhou, Yi Yan, Kevin Hu, Haixu Tang, Yijie Wang, Lu Wang, Chi Zhang, Sha Cao\",\"doi\":\"10.1093/bioinformatics/btaf508\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.</p><p><strong>Results: </strong>In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.</p><p><strong>Availability and implementation: </strong>To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12502910/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf508\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting the trend of SARS-CoV-2 mutation frequencies using historical data.
Motivation: As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.
Results: In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.
Availability and implementation: To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.