Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros
{"title":"Learning the Protein Language Model of SARS-CoV-2 Spike Proteins","authors":"Paul Vincent Llanes, Geoffrey A. Solano, Marc Jermaine Pontiveros","doi":"10.1109/ICAIIC57133.2023.10067040","DOIUrl":null,"url":null,"abstract":"Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.","PeriodicalId":105769,"journal":{"name":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAIIC57133.2023.10067040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Ahstract-SARS-CoV-2 virus has long been evolving posing an increased risk in terms of infectivity and transmissibility which causes greater impact in communities worldwide. With the surge of collected SARS-CoV-2 sequences, studies found out that most of the emerging variants are linked to increased mutations in the spike (S) protein as observed in Alpha, Beta, Gamma, and Delta variants. Multiple approaches on genomic surveillance have been performed to monitor the mutational status and spread of the virus however most are heavily dependent on labels attributed to these sequences. Hence, this study features a system that has the capability to learn the protein language model of SARS-CoV-2 spike proteins, based on a bidirectional long-short term memory (BiLSTM) recurrent neural network, using sequence data alone. Upon obtaining the sequence embedding from the model, observed clusters are generated using the Leiden clustering algorithm and is visualized to monitor similarities between variants in terms of grammatical probability and semantic change. Additionally, the system measures the validity of a user-generated next-generation sequence capturing potential sequence mutations indicative of viral escape, particularly mutations by substitutions. Further studies on methods uncovering semantic rules that govern spike proteins are recommended to learn more about other viral characteristics conclusive of the future of the COVID-19 pandemic.