预测基于序列的宿主-病原体蛋白质-蛋白质相互作用的扩展特征表示技术

IF 2.4 3区生物学 Q3 BIOCHEMICAL RESEARCH METHODS

Current Bioinformatics Pub Date : 2024-03-11 DOI:10.2174/0115748936286848240108074303

Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade

{"title":"预测基于序列的宿主-病原体蛋白质-蛋白质相互作用的扩展特征表示技术","authors":"Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade","doi":"10.2174/0115748936286848240108074303","DOIUrl":null,"url":null,"abstract":"Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"285 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction\",\"authors\":\"Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade\",\"doi\":\"10.2174/0115748936286848240108074303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.\",\"PeriodicalId\":10801,\"journal\":{\"name\":\"Current Bioinformatics\",\"volume\":\"285 1\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.2174/0115748936286848240108074303\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/0115748936286848240108074303","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：在基于序列的蛋白质-蛋白质相互作用预测中使用机器学习模型通常需要将氨基酸序列转换为特征向量。从文献来看，有两种方法可以实现这种转换。这两种方法被称为独立蛋白质特征（IPF）提取法和合并蛋白质特征（MPF）提取法。据观察，相关研究主要采用 IPF 方法，而其他研究则倾向于 MPF 方法，即在特征编码前将宿主和病原体序列合并。目标这就给确定采用哪种方法来改进 HPPPI 预测带来了挑战。因此，本研究引入了扩展蛋白质特征（EPF）方法。方法：所提出的方法结合了 IPF 和 MPF 的预测能力，提取了基本特征，处理了多重共线性，并删除了重要性为零的特征。使用细菌、寄生虫、病毒和植物 HPPPI 数据集测试了 EPF、IPF 和 MPF，并将其部署到机器学习模型中，包括随机森林 (RF)、支持向量机 (SVM)、多层感知器 (MLP)、奈夫贝叶斯 (NB)、逻辑回归 (LR) 和深度森林 (DF)。结果显示结果表明，MPF 的整体性能最低，而 IPF 在使用 RF 和 DF 等基于决策树的模型时表现更好。相比之下，EPF 在 SVM、LR、NB 和 MLP 中的性能有所提高，在 DF 和 RF 中也取得了具有竞争力的结果。结论总之，在本研究中开发的 EPF 方法在六个评估模型中的四个模型中都有显著改进。这表明 EPF 与 IPF 相比具有竞争力，尤其适合传统的机器学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction

Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Current Bioinformatics 生物-生化研究方法

CiteScore

6.60

自引率

2.50%

发文量

审稿时长

>12 weeks

期刊介绍： Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.