Learning the Protein Language Model of SARS-CoV-2 Spike Proteins

Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros
{"title":"Learning the Protein Language Model of SARS-CoV-2 Spike Proteins","authors":"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros","doi":"10.1109/ICAIIC57133.2023.10067040","DOIUrl":null,"url":null,"abstract":"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.","PeriodicalId":105769,"journal":{"name":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC57133.2023.10067040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.
SARS-CoV-2刺突蛋白的蛋白质语言模型研究
摘要- sars - cov -2病毒长期以来一直在进化,在传染性和传播性方面的风险越来越大,对全球社区造成了更大的影响。随着收集到的SARS-CoV-2序列的激增,研究发现,大多数新出现的变体与α、β、γ和δ变体中观察到的刺突(S)蛋白突变增加有关。已经采取了多种基因组监测方法来监测病毒的突变状态和传播,但大多数方法严重依赖于这些序列的标签。因此,本研究的特点是一个能够学习SARS-CoV-2刺突蛋白的蛋白质语言模型的系统,基于双向长短期记忆(BiLSTM)递归神经网络,仅使用序列数据。从模型中获得序列嵌入后,使用Leiden聚类算法生成观察到的聚类,并将其可视化,从语法概率和语义变化方面监测变体之间的相似性。此外,该系统测量用户生成的下一代序列的有效性,捕获指示病毒逃逸的潜在序列突变,特别是由替换引起的突变。建议进一步研究发现控制刺突蛋白的语义规则的方法,以了解更多关于COVID-19大流行未来的其他病毒特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信