利用深度自动编码器异常检测预测SARS-CoV-2谱系的优势性。

bioRxiv : the preprint server for biology Pub Date : 2024-09-26 DOI:10.1101/2023.10.24.563721

Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini

{"title":"利用深度自动编码器异常检测预测SARS-CoV-2谱系的优势性。","authors":"Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini","doi":"10.1101/2023.10.24.563721","DOIUrl":null,"url":null,"abstract":"The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, lineages, and sublineages, outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute more than 10% of all the viral sequences added to the GISAID database on a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of about 4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01% - 3%), with median lead times of 4-17 weeks, and predicts FDLs ~5 and ~25 times better than a baseline approach For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634784/pdf/","citationCount":"0","resultStr":"{\"title\":\"Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders.\",\"authors\":\"Simone Rancati, Giovanna Nicora, Mattia Prosperi, Riccardo Bellazzi, Marco Salemi, Simone Marini\",\"doi\":\"10.1101/2023.10.24.563721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, lineages, and sublineages, outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute more than 10% of all the viral sequences added to the GISAID database on a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of about 4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01% - 3%), with median lead times of 4-17 weeks, and predicts FDLs ~5 and ~25 times better than a baseline approach For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.\",\"PeriodicalId\":72407,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634784/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.10.24.563721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.10.24.563721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

COVID-19大流行表明，需要一个快速、有效的基于基因组的监测系统来预测新出现的SARS-CoV-2变体和谱系。利用公共卫生监测或综合序列数据库的传统分子流行病学方法能够表征感染波和遗传进化的进化史，但在预测病毒遗传改变的未来前景方面存在不足。为了弥补这一差距，我们引入了一种新的基于深度学习、自动编码器的SARS-CoV-2异常检测方法(DeepAutoCov)。对全球公共SARS-CoV-2 GISAID数据库进行培训并更新。DeepAutoCov识别未来优势谱系(fdl)，定义为每周使用Spike (S)蛋白，每周添加至少25%的SARS-CoV-2基因组的谱系。我们的算法基于通过无监督方法进行异常检测，这是必要的，因为fdl只能是后验的(即，在它们成为主导之后)。我们开发了两种并发方法(线性无监督和后验监督)来评估DeepAutoCoV的性能。DeepAutoCoV使用刺突(S)蛋白识别FDL，在全球数据上的中位提前期为31周，比其他方法获得的阳性预测值高出约7倍，高出23%。此外，它还可以提前17个月预测与疫苗相关的fdl。最后，DeepAutoCoV不仅具有预测性，而且具有可解释性，因为它可以确定fdl中的特定突变，从而产生关于谱系毒性或传播性潜在增加的假设。通过将基因组监测与人工智能相结合，我们的工作标志着一个变革性的步骤，可能为优化公共卫生预防和干预策略提供有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders.

The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants, lineages, and sublineages, outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute more than 10% of all the viral sequences added to the GISAID database on a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of about 4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01% - 3%), with median lead times of 4-17 weeks, and predicts FDLs ~5 and ~25 times better than a baseline approach For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量