Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression

IF 2.3

Frontiers in systems biology Pub Date : 2024-04-09 DOI:10.3389/fsysb.2024.1284668

Samuel King, Xinyi E. Chen, Sarah W. S. Ng, Kimia Rostin, Samuel V. Hahn, Tylo Roberts, Janella C. Schwab, Parneet Sekhon, Madina Kagieva, Taylor Reilly, Ruo Chen Qi, Paarsa Salman, Ryan J. Hong, Eric J. Ma, Steven J. Hallam

{"title":"Forecasting SARS-CoV-2 spike protein evolution from small data by deep learning and regression","authors":"Samuel King, Xinyi E. Chen, Sarah W. S. Ng, Kimia Rostin, Samuel V. Hahn, Tylo Roberts, Janella C. Schwab, Parneet Sekhon, Madina Kagieva, Taylor Reilly, Ruo Chen Qi, Paarsa Salman, Ryan J. Hong, Eric J. Ma, Steven J. Hallam","doi":"10.3389/fsysb.2024.1284668","DOIUrl":null,"url":null,"abstract":"The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) in silico, with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.","PeriodicalId":73109,"journal":{"name":"Frontiers in systems biology","volume":"47 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in systems biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fsysb.2024.1284668","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) in silico, with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.

查看原文本刊更多论文

通过深度学习和回归从小规模数据中预测 SARS-CoV-2 穗状蛋白质的进化

在 COVID-19 大流行期间，SARS-CoV-2 变种的出现导致全球频繁爆发，使许多地区的公共卫生工作陷入困境，这凸显了更好地了解和预测病毒进化的必要性。预测模型已被证明可支持疾病预防工作，如季节性流感疫苗，但它们需要大量数据。对于新出现的令人担忧的病毒，这些模型最好能在数据相对稀少的情况下发挥作用，这种情况通常出现在病毒爆发的早期阶段。由于氨基酸突变的虚假性和可逆性，以及大量可能的蛋白质序列增加了计算的复杂性，传统的离散方法已被证明难以开发。我们假设，可以通过将离散蛋白质序列编码成连续数来解决这些难题，从而有效减少数据量，同时提高进化相关差异的分辨率。为此，我们开发了病毒蛋白质进化预测模型（VPRE），该模型通过使用一种名为变异自动编码器（VAE）的人工神经网络将氨基酸序列还原为连续数字，并使用高斯过程（GP）回归对其随时间变化的最可能进化轨迹进行统计建模。为了演示 VPRE，我们使用了少量早期 SARS-CoV-2 尖峰蛋白序列。我们证明，VAE 可以在基于这些数据的合成数据集上进行训练。为了再现沿系统发育路径的进化，我们只使用了 104 个尖峰蛋白序列，并用数字变量训练了 GP 回归，以预测未来 5 个月内的进化。我们的预测包含新变体，最常见的预测主要映射到一个序列，该序列与预测时间范围内报道最多的尖峰蛋白只有一个氨基酸的差异。尖峰受体结合结构域（RBD）中的新变体能够与人类血管紧张素转换酶 2（ACE2）进行硅结合，其结合效果与之前解析的 RBD-ACE2 复合物相当或更好。这些结果共同表明，结合深度学习和回归，利用相对稀少的数据集为病毒蛋白质进化建模，对于开发更有效的医疗干预措施具有实用性和可操作性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in systems biology

自引率

0.00%

发文量