Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing

IF 4.2 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Methods Pub Date : 2024-10-22 DOI:10.1016/j.ymeth.2024.10.006

Van-Nui Nguyen , Thi-Xuan Tran , Thi-Tuyen Nguyen , Nguyen Quoc Khanh Le

{"title":"Enhancing Arabidopsis thaliana ubiquitination site prediction through knowledge distillation and natural language processing","authors":"Van-Nui Nguyen , Thi-Xuan Tran , Thi-Tuyen Nguyen , Nguyen Quoc Khanh Le","doi":"10.1016/j.ymeth.2024.10.006","DOIUrl":null,"url":null,"abstract":"<div><div>Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting <em>Arabidopsis thaliana</em> ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model’s superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: <span><span>https://github.com/nuinvtnu/KD_ArapUbi</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"232 ","pages":"Pages 65-71"},"PeriodicalIF":4.2000,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202324002238","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Protein ubiquitination is a critical post-translational modification (PTM) involved in diverse biological processes and plays a pivotal role in regulating physiological mechanisms and disease states. Despite various efforts to develop ubiquitination site prediction tools across species, these tools mainly rely on predefined sequence features and machine learning algorithms, with species-specific variations in ubiquitination patterns remaining poorly understood. This study introduces a novel approach for predicting Arabidopsis thaliana ubiquitination sites using a neural network model based on knowledge distillation and natural language processing (NLP) of protein sequences. Our framework employs a multi-species “Teacher model” to guide a more compact, species-specific “Student model”, with the “Teacher” generating pseudo-labels that enhance the “Student” learning and prediction robustness. Cross-validation results demonstrate that our model achieves superior performance, with an accuracy of 86.3 % and an area under the curve (AUC) of 0.926, while independent testing confirmed these results with an accuracy of 86.3 % and an AUC of 0.923. Comparative analysis with established predictors further highlights the model’s superiority, emphasizing the effectiveness of integrating knowledge distillation and NLP in ubiquitination prediction tasks. This study presents a promising and efficient approach for ubiquitination site prediction, offering valuable insights for researchers in related fields. The code and resources are available on GitHub: https://github.com/nuinvtnu/KD_ArapUbi.

查看原文本刊更多论文

通过知识提炼和自然语言处理提高拟南芥泛素化位点预测能力

蛋白质泛素化是一种关键的翻译后修饰（PTM），涉及多种生物过程，在调节生理机制和疾病状态方面起着关键作用。尽管人们在开发跨物种泛素化位点预测工具方面做出了各种努力，但这些工具主要依赖于预定义的序列特征和机器学习算法，对泛素化模式的物种特异性差异仍然知之甚少。本研究介绍了一种预测拟南芥泛素化位点的新方法，该方法使用基于蛋白质序列知识提炼和自然语言处理（NLP）的神经网络模型。我们的框架采用多物种 "教师模型 "来指导更紧凑、特定物种的 "学生模型"，"教师 "生成伪标签以增强 "学生 "的学习和预测鲁棒性。交叉验证结果表明，我们的模型性能优越，准确率达 86.3%，曲线下面积（AUC）为 0.926；独立测试证实了这些结果，准确率达 86.3%，曲线下面积（AUC）为 0.923。与已有预测工具的比较分析进一步凸显了该模型的优越性，强调了在泛素化预测任务中整合知识提炼和 NLP 的有效性。这项研究为泛素化位点预测提供了一种前景广阔的高效方法，为相关领域的研究人员提供了宝贵的见解。代码和资源可在 GitHub 上获取：https://github.com/nuinvtnu/KD_ArapUbi.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Methods 生物-生化研究方法

CiteScore

9.80

自引率

2.10%

发文量

222

审稿时长

11.3 weeks

期刊介绍： Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.