Learning the Protein Language Model of SARS-CoV-2 Spike Proteins

2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) Pub Date : 2023-02-20 DOI:10.1109/ICAIIC57133.2023.10067040

Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros

{"title":"Learning the Protein Language Model of SARS-CoV-2 Spike Proteins","authors":"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros","doi":"10.1109/ICAIIC57133.2023.10067040","DOIUrl":null,"url":null,"abstract":"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.","PeriodicalId":105769,"journal":{"name":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC57133.2023.10067040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.

查看原文本刊更多论文

SARS-CoV-2刺突蛋白的蛋白质语言模型研究

摘要- sars - cov -2病毒长期以来一直在进化，在传染性和传播性方面的风险越来越大，对全球社区造成了更大的影响。随着收集到的SARS-CoV-2序列的激增，研究发现，大多数新出现的变体与α、β、γ和δ变体中观察到的刺突(S)蛋白突变增加有关。已经采取了多种基因组监测方法来监测病毒的突变状态和传播，但大多数方法严重依赖于这些序列的标签。因此，本研究的特点是一个能够学习SARS-CoV-2刺突蛋白的蛋白质语言模型的系统，基于双向长短期记忆(BiLSTM)递归神经网络，仅使用序列数据。从模型中获得序列嵌入后，使用Leiden聚类算法生成观察到的聚类，并将其可视化，从语法概率和语义变化方面监测变体之间的相似性。此外，该系统测量用户生成的下一代序列的有效性，捕获指示病毒逃逸的潜在序列突变，特别是由替换引起的突变。建议进一步研究发现控制刺突蛋白的语义规则的方法，以了解更多关于COVID-19大流行未来的其他病毒特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)

自引率

0.00%

发文量