基于进化比例模型的O-GlcNAc蛋白修饰位点特异性预测

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2024-12-31 eCollection Date: 2024-01-01 DOI:10.1371/journal.pone.0316215

Ayesha Khalid, Afshan Kaleem, Wajahat Qazi, Roheena Abdullah, Mehwish Iqtedar, Shagufta Naz

{"title":"基于进化比例模型的O-GlcNAc蛋白修饰位点特异性预测","authors":"Ayesha Khalid, Afshan Kaleem, Wajahat Qazi, Roheena Abdullah, Mehwish Iqtedar, Shagufta Naz","doi":"10.1371/journal.pone.0316215","DOIUrl":null,"url":null,"abstract":"Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"19 12","pages":"e0316215"},"PeriodicalIF":2.6000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687694/pdf/","citationCount":"0","resultStr":"{\"title\":\"Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.\",\"authors\":\"Ayesha Khalid, Afshan Kaleem, Wajahat Qazi, Roheena Abdullah, Mehwish Iqtedar, Shagufta Naz\",\"doi\":\"10.1371/journal.pone.0316215\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.\",\"PeriodicalId\":20189,\"journal\":{\"name\":\"PLoS ONE\",\"volume\":\"19 12\",\"pages\":\"e0316215\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11687694/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS ONE\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pone.0316215\",\"RegionNum\":3,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0316215","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

蛋白糖基化是一种重要的翻译后修饰，在各种生物过程和疾病发病机制中起着关键作用。计算方法，包括蛋白质语言模型和机器学习算法，已经成为预测O-GlcNAc位点、降低实验成本和提高效率的有价值的工具。然而，文献尚未报道通过进化尺度模型（ESM）预测O-GlcNAc位点。因此，本研究采用ESM-2模型对人类O-GlcNAc位点进行预测。从O-GlcNAc数据库中检索到大约1100个o -连锁糖蛋白序列用于模型训练。ESM-2模型的准确率为78.30%，召回率为78.30%，准确率为61.31%，f1得分为68.74%。然而，与传统模型在相同数据上的过拟合率高达99%相比，ESM-2模型在最佳训练和测试预测方面表现优于传统模型。这些发现强调了ESM-2模型在准确预测人类蛋白质中O-GlcNAc位点方面的有效性。准确预测人类蛋白质中的O-GlcNAc位点可以通过增强我们对蛋白质功能和疾病机制的理解，帮助开发靶向治疗，促进生物标志物的发现，从而改善诊断和治疗，从而显著推进糖蛋白组学研究。此外，未来的研究应着眼于更多样化的数据类型，更长的蛋白质序列长度，以及更高的计算资源来评估各种参数。准确预测O-GlcNAc位点可能有助于研究蛋白质在生理和疾病中的位点特异性功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.

查看原文本刊更多论文

Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model.

Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage