StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-06-28 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf146
Md Muhaiminul Islam Nafi, M Saifur Rahman
{"title":"StackGlyEmbed: prediction of N-linked glycosylation sites using protein language models.","authors":"Md Muhaiminul Islam Nafi, M Saifur Rahman","doi":"10.1093/bioadv/vbaf146","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field.</p><p><strong>Results: </strong>In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and <i>K</i>-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods.</p><p><strong>Availability and implementation: </strong>StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf146"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12237515/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field.

Results: In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and K-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods.

Availability and implementation: StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.

Abstract Image

Abstract Image

Abstract Image

StackGlyEmbed:使用蛋白质语言模型预测n -链糖基化位点。
动机:N链糖基化是一种最基本的翻译后修饰(PTMs),其中低聚糖与天冬酰胺(N)共价结合。这些存在于像N-X-S或N-X-T这样的保守区域,其中X可以是除脯氨酸(P)以外的任何残基。n链糖基化位点的预测具有重要意义,因为这些PTMs在许多生物学过程和功能中起着至关重要的作用。用于检测n链糖基化位点的实验方法,如质谱法是非常昂贵的。因此,预测n链糖基化位点已成为一个重要的研究领域。结果:在这项工作中,我们提出了StackGlyEmbed,一个堆叠集成机器学习模型,用于计算预测n链糖基化位点。我们探索了几种蛋白质语言模型的嵌入,并在底层使用支持向量机(SVM)、极限梯度增强(XGB)和k近邻(KNN)学习器构建了堆叠集成,在元层使用第二个SVM模型。StackGlyEmbed在独立测试中达到98.2%的灵敏度、92.5%的平衡准确度、89.1%的f1评分和82.6%的马修相关系数,优于现有的最先进的方法。可用性和实现:StackGlyEmbed免费提供:https://github.com/nafcoder/StackGlyEmbed。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信