StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-06-28 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf146

Md Muhaiminul Islam Nafi, M Saifur Rahman

{"title":"StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.","authors":"Md Muhaiminul Islam Nafi, M Saifur Rahman","doi":"10.1093/bioadv/vbaf146","DOIUrl":null,"url":null,"abstract":"Motivation: N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field.Results: In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and K-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods.Availability and implementation: StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf146"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12237515/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field.

Results: In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and K-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods.

Availability and implementation: StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.

Abstract Image

查看原文本刊更多论文

StackGlyEmbed：使用蛋白质语言模型预测n -链糖基化位点。

动机：N链糖基化是一种最基本的翻译后修饰（PTMs），其中低聚糖与天冬酰胺(N)共价结合。这些存在于像N-X-S或N-X-T这样的保守区域，其中X可以是除脯氨酸(P)以外的任何残基。n链糖基化位点的预测具有重要意义，因为这些PTMs在许多生物学过程和功能中起着至关重要的作用。用于检测n链糖基化位点的实验方法，如质谱法是非常昂贵的。因此，预测n链糖基化位点已成为一个重要的研究领域。结果：在这项工作中，我们提出了StackGlyEmbed，一个堆叠集成机器学习模型，用于计算预测n链糖基化位点。我们探索了几种蛋白质语言模型的嵌入，并在底层使用支持向量机（SVM）、极限梯度增强（XGB）和k近邻（KNN）学习器构建了堆叠集成，在元层使用第二个SVM模型。StackGlyEmbed在独立测试中达到98.2%的灵敏度、92.5%的平衡准确度、89.1%的f1评分和82.6%的马修相关系数，优于现有的最先进的方法。可用性和实现：StackGlyEmbed免费提供：https://github.com/nafcoder/StackGlyEmbed。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量