利用蛋白质语言模型和蛋白质网络特征改进蛋白质-蛋白质相互作用预测

IF 2.6 4区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

Analytical biochemistry Pub Date : 2024-04-26 DOI:10.1016/j.ab.2024.115550

Jun Hu , Zhe Li , Bing Rao , Maha A. Thafar , Muhammad Arif

{"title":"利用蛋白质语言模型和蛋白质网络特征改进蛋白质-蛋白质相互作用预测","authors":"Jun Hu , Zhe Li , Bing Rao , Maha A. Thafar , Muhammad Arif","doi":"10.1016/j.ab.2024.115550","DOIUrl":null,"url":null,"abstract":"<div><p>Interactions between proteins are ubiquitous in a wide variety of biological processes. Accurately identifying the protein-protein interaction (PPI) is of significant importance for understanding the mechanisms of protein functions and facilitating drug discovery. Although the wet-lab technological methods are the best way to identify PPI, their major constraints are their time-consuming nature, high cost, and labor-intensiveness. Hence, lots of efforts have been made towards developing computational methods to improve the performance of PPI prediction. In this study, we propose a novel hybrid computational method (called KSGPPI) that aims at improving the prediction performance of PPI via extracting the discriminative information from protein sequences and interaction networks. The KSGPPI model comprises two feature extraction modules. In the first feature extraction module, a large protein language model, ESM-2, is employed to exploit the global complex patterns concealed within protein sequences. Subsequently, feature representations are further extracted through CKSAAP, and a two-dimensional convolutional neural network (CNN) is utilized to capture local information. In the second feature extraction module, the query protein acquires its similar protein from the STRING database via the sequence alignment tool NW-align and then captures the graph embedding feature for the query protein in the protein interaction network of the similar protein using the algorithm of Node2vec. Finally, the features of these two feature extraction modules are efficiently fused; the fused features are then fed into the multilayer perceptron to predict PPI. The results of five-fold cross-validation on the used benchmarked datasets demonstrate that KSGPPI achieves an average prediction accuracy of 88.96 %. Additionally, the average Matthews correlation coefficient value (0.781) of KSGPPI is significantly higher than that of those state-of-the-art PPI prediction methods. The standalone package of KSGPPI is freely downloaded at <span>https://github.com/rickleezhe/KSGPPI</span><svg><path></path></svg>.</p></div>","PeriodicalId":7830,"journal":{"name":"Analytical biochemistry","volume":"693 ","pages":"Article 115550"},"PeriodicalIF":2.6000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving protein-protein interaction prediction using protein language model and protein network features\",\"authors\":\"Jun Hu , Zhe Li , Bing Rao , Maha A. Thafar , Muhammad Arif\",\"doi\":\"10.1016/j.ab.2024.115550\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Interactions between proteins are ubiquitous in a wide variety of biological processes. Accurately identifying the protein-protein interaction (PPI) is of significant importance for understanding the mechanisms of protein functions and facilitating drug discovery. Although the wet-lab technological methods are the best way to identify PPI, their major constraints are their time-consuming nature, high cost, and labor-intensiveness. Hence, lots of efforts have been made towards developing computational methods to improve the performance of PPI prediction. In this study, we propose a novel hybrid computational method (called KSGPPI) that aims at improving the prediction performance of PPI via extracting the discriminative information from protein sequences and interaction networks. The KSGPPI model comprises two feature extraction modules. In the first feature extraction module, a large protein language model, ESM-2, is employed to exploit the global complex patterns concealed within protein sequences. Subsequently, feature representations are further extracted through CKSAAP, and a two-dimensional convolutional neural network (CNN) is utilized to capture local information. In the second feature extraction module, the query protein acquires its similar protein from the STRING database via the sequence alignment tool NW-align and then captures the graph embedding feature for the query protein in the protein interaction network of the similar protein using the algorithm of Node2vec. Finally, the features of these two feature extraction modules are efficiently fused; the fused features are then fed into the multilayer perceptron to predict PPI. The results of five-fold cross-validation on the used benchmarked datasets demonstrate that KSGPPI achieves an average prediction accuracy of 88.96 %. Additionally, the average Matthews correlation coefficient value (0.781) of KSGPPI is significantly higher than that of those state-of-the-art PPI prediction methods. The standalone package of KSGPPI is freely downloaded at <span>https://github.com/rickleezhe/KSGPPI</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":7830,\"journal\":{\"name\":\"Analytical biochemistry\",\"volume\":\"693 \",\"pages\":\"Article 115550\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Analytical biochemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003269724000940\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical biochemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003269724000940","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质之间的相互作用在各种生物过程中无处不在。准确鉴定蛋白质-蛋白质相互作用（PPI）对于了解蛋白质功能机制和促进药物发现具有重要意义。虽然湿实验室技术方法是鉴定 PPI 的最佳途径，但其主要限制因素是耗时长、成本高和劳动强度大。因此，人们一直在努力开发计算方法，以提高 PPI 预测的性能。在本研究中，我们提出了一种新型混合计算方法（称为 KSGPPI），旨在通过提取蛋白质序列和相互作用网络中的判别信息来提高 PPI 的预测性能。KSGPPI 模型包括两个特征提取模块。在第一个特征提取模块中，采用了大型蛋白质语言模型ESM-2，以利用隐藏在蛋白质序列中的全局复杂模式。随后，通过 CKSAAP 进一步提取特征表征，并利用二维卷积神经网络（CNN）捕捉局部信息。在第二个特征提取模块中，查询蛋白质通过序列比对工具 NW-align 从 STRING 数据库中获取其相似蛋白质，然后利用 Node2vec 算法在相似蛋白质的蛋白质相互作用网络中捕捉查询蛋白质的图嵌入特征。最后，将这两个特征提取模块的特征进行有效融合；然后将融合后的特征输入全连接神经网络，以预测 PPI。在所使用的基准数据集上进行的五倍交叉验证结果表明，KSGPPI 的平均预测准确率达到了 88.96%。此外，KSGPPI 的平均马修斯相关系数值（0.781）明显高于最先进的 PPI 预测方法。KSGPPI 的独立软件包可从 https://github.com/rickleezhe/KSGPPI 免费下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Improving protein-protein interaction prediction using protein language model and protein network features

查看原文本刊更多论文

Improving protein-protein interaction prediction using protein language model and protein network features

Interactions between proteins are ubiquitous in a wide variety of biological processes. Accurately identifying the protein-protein interaction (PPI) is of significant importance for understanding the mechanisms of protein functions and facilitating drug discovery. Although the wet-lab technological methods are the best way to identify PPI, their major constraints are their time-consuming nature, high cost, and labor-intensiveness. Hence, lots of efforts have been made towards developing computational methods to improve the performance of PPI prediction. In this study, we propose a novel hybrid computational method (called KSGPPI) that aims at improving the prediction performance of PPI via extracting the discriminative information from protein sequences and interaction networks. The KSGPPI model comprises two feature extraction modules. In the first feature extraction module, a large protein language model, ESM-2, is employed to exploit the global complex patterns concealed within protein sequences. Subsequently, feature representations are further extracted through CKSAAP, and a two-dimensional convolutional neural network (CNN) is utilized to capture local information. In the second feature extraction module, the query protein acquires its similar protein from the STRING database via the sequence alignment tool NW-align and then captures the graph embedding feature for the query protein in the protein interaction network of the similar protein using the algorithm of Node2vec. Finally, the features of these two feature extraction modules are efficiently fused; the fused features are then fed into the multilayer perceptron to predict PPI. The results of five-fold cross-validation on the used benchmarked datasets demonstrate that KSGPPI achieves an average prediction accuracy of 88.96 %. Additionally, the average Matthews correlation coefficient value (0.781) of KSGPPI is significantly higher than that of those state-of-the-art PPI prediction methods. The standalone package of KSGPPI is freely downloaded at https://github.com/rickleezhe/KSGPPI.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Analytical biochemistry 生物-分析化学

CiteScore

5.70

自引率

0.00%

发文量

283

审稿时长

44 days

期刊介绍： The journal''s title Analytical Biochemistry: Methods in the Biological Sciences declares its broad scope: methods for the basic biological sciences that include biochemistry, molecular genetics, cell biology, proteomics, immunology, bioinformatics and wherever the frontiers of research take the field. The emphasis is on methods from the strictly analytical to the more preparative that would include novel approaches to protein purification as well as improvements in cell and organ culture. The actual techniques are equally inclusive ranging from aptamers to zymology. The journal has been particularly active in: -Analytical techniques for biological molecules- Aptamer selection and utilization- Biosensors- Chromatography- Cloning, sequencing and mutagenesis- Electrochemical methods- Electrophoresis- Enzyme characterization methods- Immunological approaches- Mass spectrometry of proteins and nucleic acids- Metabolomics- Nano level techniques- Optical spectroscopy in all its forms. The journal is reluctant to include most drug and strictly clinical studies as there are more suitable publication platforms for these types of papers.