Yinan Feng, Emma E Goldberg, Michael Kupperman, Xitong Zhang, Youzuo Lin, Ruian Ke
{"title":"CovTransformer:用于SARS-CoV-2谱系频率预测的变压器模型。","authors":"Yinan Feng, Emma E Goldberg, Michael Kupperman, Xitong Zhang, Youzuo Lin, Ruian Ke","doi":"10.1093/ve/veae086","DOIUrl":null,"url":null,"abstract":"<p><p>With hundreds of SARS-CoV-2 lineages circulating in the global population, there is an ongoing need for predicting and forecasting lineage frequencies and thus identifying rapidly expanding lineages. Accurate prediction would allow for more focused experimental efforts to understand pathogenicity of future dominating lineages and characterize the extent of their immune escape. Here, we first show that the inherent noise and biases in lineage frequency data make a commonly-used regression-based approach unreliable. To address this weakness, we constructed a machine learning model for SARS-CoV-2 lineage frequency forecasting, called CovTransformer, based on the transformer architecture. We designed our model to navigate challenges such as a limited amount of data with high levels of noise and bias. We first trained and tested the model using data from the UK and the USA, and then tested the generalization ability of the model to many other countries and US states. Remarkably, the trained model makes accurate predictions two months into the future with high levels of accuracy both globally (in 31 countries with high levels of sequencing effort) and at the US-state level. Our model performed substantially better than a widely used forecasting tool, the multinomial regression model implemented in Nextstrain, demonstrating its utility in SARS-CoV-2 monitoring. Assuming a newly emerged lineage is identified and assigned, our test using retrospective data shows that our model is able to identify the dominating lineages 7 weeks in advance on average before they became dominant. Overall, our work demonstrates that transformer models represent a promising approach for SARS-CoV-2 forecasting and pandemic monitoring.</p>","PeriodicalId":56026,"journal":{"name":"Virus Evolution","volume":"10 1","pages":"veae086"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631054/pdf/","citationCount":"0","resultStr":"{\"title\":\"CovTransformer: A transformer model for SARS-CoV-2 lineage frequency forecasting.\",\"authors\":\"Yinan Feng, Emma E Goldberg, Michael Kupperman, Xitong Zhang, Youzuo Lin, Ruian Ke\",\"doi\":\"10.1093/ve/veae086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>With hundreds of SARS-CoV-2 lineages circulating in the global population, there is an ongoing need for predicting and forecasting lineage frequencies and thus identifying rapidly expanding lineages. Accurate prediction would allow for more focused experimental efforts to understand pathogenicity of future dominating lineages and characterize the extent of their immune escape. Here, we first show that the inherent noise and biases in lineage frequency data make a commonly-used regression-based approach unreliable. To address this weakness, we constructed a machine learning model for SARS-CoV-2 lineage frequency forecasting, called CovTransformer, based on the transformer architecture. We designed our model to navigate challenges such as a limited amount of data with high levels of noise and bias. We first trained and tested the model using data from the UK and the USA, and then tested the generalization ability of the model to many other countries and US states. Remarkably, the trained model makes accurate predictions two months into the future with high levels of accuracy both globally (in 31 countries with high levels of sequencing effort) and at the US-state level. Our model performed substantially better than a widely used forecasting tool, the multinomial regression model implemented in Nextstrain, demonstrating its utility in SARS-CoV-2 monitoring. Assuming a newly emerged lineage is identified and assigned, our test using retrospective data shows that our model is able to identify the dominating lineages 7 weeks in advance on average before they became dominant. Overall, our work demonstrates that transformer models represent a promising approach for SARS-CoV-2 forecasting and pandemic monitoring.</p>\",\"PeriodicalId\":56026,\"journal\":{\"name\":\"Virus Evolution\",\"volume\":\"10 1\",\"pages\":\"veae086\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631054/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Virus Evolution\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/ve/veae086\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"VIROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virus Evolution","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ve/veae086","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VIROLOGY","Score":null,"Total":0}
CovTransformer: A transformer model for SARS-CoV-2 lineage frequency forecasting.
With hundreds of SARS-CoV-2 lineages circulating in the global population, there is an ongoing need for predicting and forecasting lineage frequencies and thus identifying rapidly expanding lineages. Accurate prediction would allow for more focused experimental efforts to understand pathogenicity of future dominating lineages and characterize the extent of their immune escape. Here, we first show that the inherent noise and biases in lineage frequency data make a commonly-used regression-based approach unreliable. To address this weakness, we constructed a machine learning model for SARS-CoV-2 lineage frequency forecasting, called CovTransformer, based on the transformer architecture. We designed our model to navigate challenges such as a limited amount of data with high levels of noise and bias. We first trained and tested the model using data from the UK and the USA, and then tested the generalization ability of the model to many other countries and US states. Remarkably, the trained model makes accurate predictions two months into the future with high levels of accuracy both globally (in 31 countries with high levels of sequencing effort) and at the US-state level. Our model performed substantially better than a widely used forecasting tool, the multinomial regression model implemented in Nextstrain, demonstrating its utility in SARS-CoV-2 monitoring. Assuming a newly emerged lineage is identified and assigned, our test using retrospective data shows that our model is able to identify the dominating lineages 7 weeks in advance on average before they became dominant. Overall, our work demonstrates that transformer models represent a promising approach for SARS-CoV-2 forecasting and pandemic monitoring.
期刊介绍:
Virus Evolution is a new Open Access journal focusing on the long-term evolution of viruses, viruses as a model system for studying evolutionary processes, viral molecular epidemiology and environmental virology.
The aim of the journal is to provide a forum for original research papers, reviews, commentaries and a venue for in-depth discussion on the topics relevant to virus evolution.