PhosBERT: A self-supervised learning model for identifying phosphorylation sites in SARS-CoV-2-infected human cells

IF 4.2 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Yong Li , Ru Gao , Shan Liu , Hongqi Zhang , Hao Lv , Hongyan Lai
{"title":"PhosBERT: A self-supervised learning model for identifying phosphorylation sites in SARS-CoV-2-infected human cells","authors":"Yong Li ,&nbsp;Ru Gao ,&nbsp;Shan Liu ,&nbsp;Hongqi Zhang ,&nbsp;Hao Lv ,&nbsp;Hongyan Lai","doi":"10.1016/j.ymeth.2024.08.004","DOIUrl":null,"url":null,"abstract":"<div><p>Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus, which mainly causes respiratory and enteric diseases and is responsible for the outbreak of coronavirus disease 19 (COVID-19). Numerous studies have demonstrated that SARS-CoV-2 infection will lead to a significant dysregulation of protein post-translational modification profile in human cells. The accurate recognition of phosphorylation sites in host cells will contribute to a deep understanding of the pathogenic mechanisms of SARS-CoV-2 and also help to screen drugs and compounds with antiviral potential. Therefore, there is a need to develop cost-effective and high-precision computational strategies for specifically identifying SARS-CoV-2-infected phosphorylation sites. In this work, we first implemented a custom neural network model (named PhosBERT) on the basis of a pre-trained protein language model of ProtBert, which was a self-supervised learning approach developed on the Bidirectional Encoder Representation from Transformers (BERT) architecture. PhosBERT was then trained and validated on serine (S) and threonine (T) phosphorylation dataset and tyrosine (Y) phosphorylation dataset with 5-fold cross-validation, respectively. Independent validation results showed that PhosBERT could identify S/T phosphorylation sites with high accuracy and <em>AUC</em> (area under the receiver operating characteristic) value of 81.9% and 0.896. The prediction accuracy and <em>AUC</em> value of Y phosphorylation sites reached up to 87.1% and 0.902. It indicated that the proposed model was of good prediction ability and stability and would provide a new approach for studying SARS-CoV-2 phosphorylation sites.</p></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"230 ","pages":"Pages 140-146"},"PeriodicalIF":4.2000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202324001865","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus, which mainly causes respiratory and enteric diseases and is responsible for the outbreak of coronavirus disease 19 (COVID-19). Numerous studies have demonstrated that SARS-CoV-2 infection will lead to a significant dysregulation of protein post-translational modification profile in human cells. The accurate recognition of phosphorylation sites in host cells will contribute to a deep understanding of the pathogenic mechanisms of SARS-CoV-2 and also help to screen drugs and compounds with antiviral potential. Therefore, there is a need to develop cost-effective and high-precision computational strategies for specifically identifying SARS-CoV-2-infected phosphorylation sites. In this work, we first implemented a custom neural network model (named PhosBERT) on the basis of a pre-trained protein language model of ProtBert, which was a self-supervised learning approach developed on the Bidirectional Encoder Representation from Transformers (BERT) architecture. PhosBERT was then trained and validated on serine (S) and threonine (T) phosphorylation dataset and tyrosine (Y) phosphorylation dataset with 5-fold cross-validation, respectively. Independent validation results showed that PhosBERT could identify S/T phosphorylation sites with high accuracy and AUC (area under the receiver operating characteristic) value of 81.9% and 0.896. The prediction accuracy and AUC value of Y phosphorylation sites reached up to 87.1% and 0.902. It indicated that the proposed model was of good prediction ability and stability and would provide a new approach for studying SARS-CoV-2 phosphorylation sites.

PhosBERT:用于识别 SARS-CoV-2 感染人类细胞中磷酸化位点的自监督学习模型。
严重急性呼吸系统综合征冠状病毒 2(SARS-CoV-2)是一种单链 RNA 病毒,主要引起呼吸道和肠道疾病,是冠状病毒疾病 19(COVID-19)爆发的罪魁祸首。大量研究表明,SARS-CoV-2 感染会导致人体细胞内蛋白质翻译后修饰谱发生显著失调。准确识别宿主细胞中的磷酸化位点有助于深入了解 SARS-CoV-2 的致病机制,也有助于筛选具有抗病毒潜力的药物和化合物。因此,有必要开发具有成本效益和高精度的计算策略,以特异性地识别 SARS-CoV-2 感染的磷酸化位点。在这项工作中,我们首先在预先训练好的蛋白质语言模型ProtBert的基础上建立了一个自定义神经网络模型(命名为PhosBERT),ProtBert是一种基于双向变换器编码器表征(BERT)架构开发的自我监督学习方法。随后,PhosBERT 分别在丝氨酸(S)和苏氨酸(T)磷酸化数据集和酪氨酸(Y)磷酸化数据集上进行了训练和验证,并进行了 5 倍交叉验证。独立验证结果表明,PhosBERT能识别S/T磷酸化位点,准确率和AUC(接收者操作特征下面积)值分别为81.9%和0.896。对 Y 磷酸化位点的预测准确率和 AUC 值分别高达 87.1%和 0.902。这表明所提出的模型具有良好的预测能力和稳定性,为研究 SARS-CoV-2 磷酸化位点提供了一种新的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Methods
Methods 生物-生化研究方法
CiteScore
9.80
自引率
2.10%
发文量
222
审稿时长
11.3 weeks
期刊介绍: Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信