DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers

IF 2.4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics Pub Date : 2024-02-02 DOI:10.2174/0115748936283134240109054157

Necla Nisa Soylu, Emre Sefer

{"title":"DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers","authors":"Necla Nisa Soylu, Emre Sefer","doi":"10.2174/0115748936283134240109054157","DOIUrl":null,"url":null,"abstract":"Introduction:: More recent self-supervised deep language models, such as Bidirectional Encoder Representations from Transformers (BERT), have performed the best on some language tasks by contextualizing word embeddings for a better dynamic representation. Their proteinspecific versions, such as ProtBERT, generated dynamic protein sequence embeddings, which resulted in better performance for several bioinformatics tasks. Besides, a number of different protein post-translational modifications are prominent in cellular tasks such as development and differentiation. The current biological experiments can detect these modifications, but within a longer duration and with a significant cost. Methods:: In this paper, to comprehend the accompanying biological processes concisely and more rapidly, we propose DEEPPTM to predict protein post-translational modification (PTM) sites from protein sequences more efficiently. Different than the current methods, DEEPPTM enhances the modification prediction performance by integrating specialized ProtBERT-based protein embeddings with attention-based vision transformers (ViT), and reveals the associations between different modification types and protein sequence content. Additionally, it can infer several different modifications over different species. Results:: Human and mouse ROC AUCs for predicting Succinylation modifications were 0.988 and 0.965 respectively, once 10-fold cross-validation is applied. Similarly, we have obtained 0.982, 0.955, and 0.953 ROC AUC scores on inferring ubiquitination, crotonylation, and glycation sites, respectively. According to detailed computational experiments, DEEPPTM lessens the time spent in laboratory experiments while outperforming the competing methods as well as baselines on inferring all 4 modification sites. In our case, attention-based deep learning methods such as vision transformers look more favorable to learning from ProtBERT features than more traditional deep learning and machine learning techniques. Conclusion:: Additionally, the protein-specific ProtBERT model is more effective than the original BERT embeddings for PTM prediction tasks. Our code and datasets can be found at https://github.com/seferlab/deepptm.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936283134240109054157","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction:: More recent self-supervised deep language models, such as Bidirectional Encoder Representations from Transformers (BERT), have performed the best on some language tasks by contextualizing word embeddings for a better dynamic representation. Their proteinspecific versions, such as ProtBERT, generated dynamic protein sequence embeddings, which resulted in better performance for several bioinformatics tasks. Besides, a number of different protein post-translational modifications are prominent in cellular tasks such as development and differentiation. The current biological experiments can detect these modifications, but within a longer duration and with a significant cost. Methods:: In this paper, to comprehend the accompanying biological processes concisely and more rapidly, we propose DEEPPTM to predict protein post-translational modification (PTM) sites from protein sequences more efficiently. Different than the current methods, DEEPPTM enhances the modification prediction performance by integrating specialized ProtBERT-based protein embeddings with attention-based vision transformers (ViT), and reveals the associations between different modification types and protein sequence content. Additionally, it can infer several different modifications over different species. Results:: Human and mouse ROC AUCs for predicting Succinylation modifications were 0.988 and 0.965 respectively, once 10-fold cross-validation is applied. Similarly, we have obtained 0.982, 0.955, and 0.953 ROC AUC scores on inferring ubiquitination, crotonylation, and glycation sites, respectively. According to detailed computational experiments, DEEPPTM lessens the time spent in laboratory experiments while outperforming the competing methods as well as baselines on inferring all 4 modification sites. In our case, attention-based deep learning methods such as vision transformers look more favorable to learning from ProtBERT features than more traditional deep learning and machine learning techniques. Conclusion:: Additionally, the protein-specific ProtBERT model is more effective than the original BERT embeddings for PTM prediction tasks. Our code and datasets can be found at https://github.com/seferlab/deepptm.

查看原文本刊更多论文

DeepPTM：通过将深度蛋白质语言模型与视觉变换器相结合，从蛋白质序列预测蛋白质翻译后修饰

简介：：最近的自监督深度语言模型，如来自变换器的双向编码器表示（BERT），通过上下文词嵌入以获得更好的动态表示，在一些语言任务中表现最佳。它们的蛋白质特定版本（如 ProtBERT）生成了动态蛋白质序列嵌入，从而在一些生物信息学任务中取得了更好的性能。此外，一些不同的蛋白质翻译后修饰在细胞任务（如发育和分化）中非常突出。目前的生物实验可以检测这些修饰，但持续时间较长，成本较高。方法为了更简洁、更快速地理解伴随的生物过程，我们在本文中提出了 DEEPPTM，以更高效地从蛋白质序列中预测蛋白质翻译后修饰（PTM）位点。与现有方法不同，DEEPPTM 通过整合基于 ProtBERT 的专业蛋白质嵌入和基于注意力的视觉转换器（ViT），提高了修饰预测性能，并揭示了不同修饰类型与蛋白质序列内容之间的关联。此外，它还能推断出不同物种的多种不同修饰。结果应用 10 倍交叉验证后，人类和小鼠琥珀酰化修饰预测的 ROC AUC 分别为 0.988 和 0.965。同样，我们在推断泛素化、巴豆酰化和糖化位点时也分别获得了 0.982、0.955 和 0.953 的 ROC AUC 分数。根据详细的计算实验，DEEPPTM 减少了实验室实验所花费的时间，同时在推断所有 4 个修饰位点方面优于竞争方法和基线方法。在我们的案例中，与传统的深度学习和机器学习技术相比，基于注意力的深度学习方法（如视觉转换器）更有利于从 ProtBERT 特征中学习。结论此外，在 PTM 预测任务中，蛋白质特异性 ProtBERT 模型比原始 BERT 嵌入更有效。我们的代码和数据集见 https://github.com/seferlab/deepptm。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Current Bioinformatics 生物-生化研究方法

CiteScore

6.60

自引率

2.50%

发文量

审稿时长

>12 weeks

期刊介绍： Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.