CoNglyPred: Accurate Prediction of N-Linked Glycosylation Sites Using ESM-2 and Structural Features With Graph Network and Co-Attention.

IF 3.4 4区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

Proteomics Pub Date : 2024-10-03 DOI:10.1002/pmic.202400210

Hongmei Wang, Long Zhao, Ziyuan Yu, Ximin Zeng, Shaoping Shi

{"title":"CoNglyPred: Accurate Prediction of N-Linked Glycosylation Sites Using ESM-2 and Structural Features With Graph Network and Co-Attention.","authors":"Hongmei Wang, Long Zhao, Ziyuan Yu, Ximin Zeng, Shaoping Shi","doi":"10.1002/pmic.202400210","DOIUrl":null,"url":null,"abstract":"<p><p>N-Linked glycosylation is crucial for various biological processes such as protein folding, immune response, and cellular transport. Traditional experimental methods for determining N-linked glycosylation sites entail substantial time and labor investment, which has led to the development of computational approaches as a more efficient alternative. However, due to the limited availability of 3D structural data, existing prediction methods often struggle to fully utilize structural information and fall short in integrating sequence and structural information effectively. Motivated by the progress of protein pretrained language models (pLMs) and the breakthrough in protein structure prediction, we introduced a high-accuracy model called CoNglyPred. Having compared various pLMs, we opt for the large-scale pLM ESM-2 to extract sequence embeddings, thus mitigating certain limitations associated with manual feature extraction. Meanwhile, our approach employs a graph transformer network to process the 3D protein structures predicted by AlphaFold2. The final graph output and ESM-2 embedding are intricately integrated through a co-attention mechanism. Among a series of comprehensive experiments on the independent test dataset, CoNglyPred outperforms state-of-the-art models and demonstrates exceptional performance in case study. In addition, we are the first to report the uncertainty of N-linked glycosylation predictors using expected calibration error and expected uncertainty calibration error.</p>","PeriodicalId":224,"journal":{"name":"Proteomics","volume":" ","pages":"e202400210"},"PeriodicalIF":3.4000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pmic.202400210","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

N-Linked glycosylation is crucial for various biological processes such as protein folding, immune response, and cellular transport. Traditional experimental methods for determining N-linked glycosylation sites entail substantial time and labor investment, which has led to the development of computational approaches as a more efficient alternative. However, due to the limited availability of 3D structural data, existing prediction methods often struggle to fully utilize structural information and fall short in integrating sequence and structural information effectively. Motivated by the progress of protein pretrained language models (pLMs) and the breakthrough in protein structure prediction, we introduced a high-accuracy model called CoNglyPred. Having compared various pLMs, we opt for the large-scale pLM ESM-2 to extract sequence embeddings, thus mitigating certain limitations associated with manual feature extraction. Meanwhile, our approach employs a graph transformer network to process the 3D protein structures predicted by AlphaFold2. The final graph output and ESM-2 embedding are intricately integrated through a co-attention mechanism. Among a series of comprehensive experiments on the independent test dataset, CoNglyPred outperforms state-of-the-art models and demonstrates exceptional performance in case study. In addition, we are the first to report the uncertainty of N-linked glycosylation predictors using expected calibration error and expected uncertainty calibration error.

查看原文本刊更多论文

CoNglyPred：利用 ESM-2 和结构特征以及图形网络和共注意力准确预测 N-连接糖基化位点。

N-连接糖基化对蛋白质折叠、免疫反应和细胞运输等各种生物过程至关重要。确定N-连接糖基化位点的传统实验方法需要投入大量的时间和人力，因此人们开始开发计算方法作为更有效的替代方法。然而，由于三维结构数据有限，现有的预测方法往往难以充分利用结构信息，无法有效整合序列和结构信息。在蛋白质预训练语言模型（pLMs）取得进展和蛋白质结构预测取得突破的推动下，我们引入了一种名为 CoNglyPred 的高精度模型。在比较了各种 pLM 后，我们选择了大规模 pLM ESM-2 来提取序列嵌入，从而减少了人工特征提取的某些局限性。同时，我们的方法采用图转换器网络来处理 AlphaFold2 预测的三维蛋白质结构。最终的图输出和 ESM-2 嵌入通过共同关注机制错综复杂地结合在一起。在对独立测试数据集进行的一系列综合实验中，CoNglyPred 的表现优于最先进的模型，并在案例研究中表现出卓越的性能。此外，我们还首次使用预期校准误差和预期不确定性校准误差报告了N-连接糖基化预测因子的不确定性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proteomics 生物-生化研究方法

CiteScore

6.30

自引率

5.90%

发文量

193

审稿时长

3 months

期刊介绍： PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.