基于蛋白质语言模型和交叉注意力机制的蛋白质-肽结合残基预测

IF 2.6 4区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

Analytical biochemistry Pub Date : 2024-08-08 DOI:10.1016/j.ab.2024.115637

Jun Hu , Kai-Xin Chen , Bing Rao , Jing-Yuan Ni , Maha A. Thafar , Somayah Albaradei , Muhammad Arif

{"title":"基于蛋白质语言模型和交叉注意力机制的蛋白质-肽结合残基预测","authors":"Jun Hu , Kai-Xin Chen , Bing Rao , Jing-Yuan Ni , Maha A. Thafar , Somayah Albaradei , Muhammad Arif","doi":"10.1016/j.ab.2024.115637","DOIUrl":null,"url":null,"abstract":"<div><p>Accurate identifications of protein-peptide binding residues are essential for protein-peptide interactions and advancing drug discovery. To address this problem, extensive research efforts have been made to design more discriminative feature representations. However, extracting these explicit features usually depend on third-party tools, resulting in low computational efficacy and suffering from low predictive performance. In this study, we design an end-to-end deep learning-based method, E2EPep, for protein-peptide binding residue prediction using protein sequence only. E2EPep first employs and fine-tunes two state-of-the-art pre-trained protein language models that can extract two different high-latent feature representations from protein sequences relevant for protein structures and functions. A novel feature fusion module is then designed in E2EPep to fuse and optimize the above two feature representations of binding residues. In addition, we have also design E2EPep+, which integrates E2EPep and PepBCL models, to improve the prediction performance. Experimental results on two independent testing data sets demonstrate that E2EPep and E2EPep + could achieve the average AUC values of 0.846 and 0.842 while achieving an average Matthew's correlation coefficient value that is significantly higher than that of existing most of sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis shows that the primary strength of E2EPep lies in the effectiveness of feature representation using cross-attention mechanism to fuse the embeddings generated by two fine-tuned protein language models. The standalone package of E2EPep and E2EPep + can be obtained at <span><span>https://github.com/ckx259/E2EPep.git</span><svg><path></path></svg></span> for academic use only.</p></div>","PeriodicalId":7830,"journal":{"name":"Analytical biochemistry","volume":"694 ","pages":"Article 115637"},"PeriodicalIF":2.6000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism\",\"authors\":\"Jun Hu , Kai-Xin Chen , Bing Rao , Jing-Yuan Ni , Maha A. Thafar , Somayah Albaradei , Muhammad Arif\",\"doi\":\"10.1016/j.ab.2024.115637\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Accurate identifications of protein-peptide binding residues are essential for protein-peptide interactions and advancing drug discovery. To address this problem, extensive research efforts have been made to design more discriminative feature representations. However, extracting these explicit features usually depend on third-party tools, resulting in low computational efficacy and suffering from low predictive performance. In this study, we design an end-to-end deep learning-based method, E2EPep, for protein-peptide binding residue prediction using protein sequence only. E2EPep first employs and fine-tunes two state-of-the-art pre-trained protein language models that can extract two different high-latent feature representations from protein sequences relevant for protein structures and functions. A novel feature fusion module is then designed in E2EPep to fuse and optimize the above two feature representations of binding residues. In addition, we have also design E2EPep+, which integrates E2EPep and PepBCL models, to improve the prediction performance. Experimental results on two independent testing data sets demonstrate that E2EPep and E2EPep + could achieve the average AUC values of 0.846 and 0.842 while achieving an average Matthew's correlation coefficient value that is significantly higher than that of existing most of sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis shows that the primary strength of E2EPep lies in the effectiveness of feature representation using cross-attention mechanism to fuse the embeddings generated by two fine-tuned protein language models. The standalone package of E2EPep and E2EPep + can be obtained at <span><span>https://github.com/ckx259/E2EPep.git</span><svg><path></path></svg></span> for academic use only.</p></div>\",\"PeriodicalId\":7830,\"journal\":{\"name\":\"Analytical biochemistry\",\"volume\":\"694 \",\"pages\":\"Article 115637\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Analytical biochemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003269724001817\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical biochemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003269724001817","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

准确识别蛋白质-肽结合残基对于蛋白质-肽相互作用和推进药物发现至关重要。为解决这一问题，人们已在设计更具鉴别性的特征表示方面做出了大量研究。然而，提取这些显式特征通常依赖于第三方工具，导致计算效率低下，预测性能不高。在本研究中，我们设计了一种基于端到端深度学习的方法--E2EPep，仅利用蛋白质序列进行蛋白质-肽结合残基预测。E2EPep 首先采用并微调了两个最先进的预训练蛋白质语言模型，这两个模型可以从与蛋白质结构和功能相关的蛋白质序列中提取两种不同的高通量特征表征。然后在 E2EPep 中设计了一个新颖的特征融合模块，用于融合和优化上述两种结合残基的特征表征。此外，我们还设计了整合 E2EPep 和 PepBCL 模型的 E2EPep+，以提高预测性能。在两个独立测试数据集上的实验结果表明，E2EPep 和 E2EPep+ 的平均 AUC 值分别为 0.846 和 0.842，而平均马修相关系数（Matthew's correlation coefficient）值则明显高于现有的大多数基于序列的方法，并与最先进的基于结构的预测方法相当。详细的数据分析表明，E2EPep 的主要优势在于利用交叉注意机制进行特征表示的有效性，从而融合两个经过微调的蛋白质语言模型生成的嵌入。E2EPep 和 E2EPep+ 的独立软件包可从 https://github.com/ckx259/E2EPep.git 获取，仅供学术使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism

查看原文本刊更多论文

Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism

Accurate identifications of protein-peptide binding residues are essential for protein-peptide interactions and advancing drug discovery. To address this problem, extensive research efforts have been made to design more discriminative feature representations. However, extracting these explicit features usually depend on third-party tools, resulting in low computational efficacy and suffering from low predictive performance. In this study, we design an end-to-end deep learning-based method, E2EPep, for protein-peptide binding residue prediction using protein sequence only. E2EPep first employs and fine-tunes two state-of-the-art pre-trained protein language models that can extract two different high-latent feature representations from protein sequences relevant for protein structures and functions. A novel feature fusion module is then designed in E2EPep to fuse and optimize the above two feature representations of binding residues. In addition, we have also design E2EPep+, which integrates E2EPep and PepBCL models, to improve the prediction performance. Experimental results on two independent testing data sets demonstrate that E2EPep and E2EPep + could achieve the average AUC values of 0.846 and 0.842 while achieving an average Matthew's correlation coefficient value that is significantly higher than that of existing most of sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis shows that the primary strength of E2EPep lies in the effectiveness of feature representation using cross-attention mechanism to fuse the embeddings generated by two fine-tuned protein language models. The standalone package of E2EPep and E2EPep + can be obtained at https://github.com/ckx259/E2EPep.git for academic use only.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Analytical biochemistry 生物-分析化学

CiteScore

5.70

自引率

0.00%

发文量

283

审稿时长

44 days

期刊介绍： The journal''s title Analytical Biochemistry: Methods in the Biological Sciences declares its broad scope: methods for the basic biological sciences that include biochemistry, molecular genetics, cell biology, proteomics, immunology, bioinformatics and wherever the frontiers of research take the field. The emphasis is on methods from the strictly analytical to the more preparative that would include novel approaches to protein purification as well as improvements in cell and organ culture. The actual techniques are equally inclusive ranging from aptamers to zymology. The journal has been particularly active in: -Analytical techniques for biological molecules- Aptamer selection and utilization- Biosensors- Chromatography- Cloning, sequencing and mutagenesis- Electrochemical methods- Electrophoresis- Enzyme characterization methods- Immunological approaches- Mass spectrometry of proteins and nucleic acids- Metabolomics- Nano level techniques- Optical spectroscopy in all its forms. The journal is reluctant to include most drug and strictly clinical studies as there are more suitable publication platforms for these types of papers.