{"title":"Enhancing Drug Peptide Sequence Prediction Using Multi-view Feature Fusion Learning","authors":"Junyu Zhang, Ronglin Lu, Hongmei Zhou, Xinbo Jiang","doi":"10.2174/0115748936294345240510112941","DOIUrl":"https://doi.org/10.2174/0115748936294345240510112941","url":null,"abstract":"Background: Currently, various types of peptides have broad implications for human health and disease. Some drug peptides play significant roles in sensory science, drug research, and cancer biology. The prediction and classification of peptide sequences are of significant importance to various industries. However, predicting peptide sequences through biological experiments is a time-consuming and expensive process. Moreover, the task of protein sequence classification and prediction faces challenges due to the high dimensionality, nonlinearity, and irregularity of protein sequence data, along with the presence of numerous unknown or unlabeled protein sequences. Therefore, an accurate and efficient method for predicting peptide classification is necessary. Methods: In our work, we used two pre-trained models to extract sequence features, TextCNN (Convolutional Neural Networks for Text Classification) and Transformer. We extracted the overall semantic information of the sequences using Transformer Encoder and extracted the local semantic information between sequences using TextCNN and concatenated them into a new feature. Finally, we used the concatenated feature for classification prediction. To validate this approach, we conducted experiments on the BP dataset, THP dataset and DPP-IV dataset and compared them with some pre-trained models. Results: Since TextCNN and Transformer Encoder extract features from different perspectives, the concatenated feature contains multi-view information, which improves the accuracy of the peptide predictor. Conclusion: Ultimately, our model demonstrated superior metrics, highlighting its efficacy in peptide sequence prediction and classification.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"23 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141168940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Validating the Distinctiveness of the Omicron Lineage within the SARSCov-2 based on Protein Language Models","authors":"Ke Dong, Jingyang Gao","doi":"10.2174/0115748936291075240409080924","DOIUrl":"https://doi.org/10.2174/0115748936291075240409080924","url":null,"abstract":"Introduction: Variants of concern were identified in severe acute respiratory syndrome coronavirus 2, namely Alpha, Beta, Gamma, Delta, and Omicron. This study explores the mutations of the Omicron lineage and its differences from other lineages through a protein language model. Methods: By inputting the severe acute respiratory syndrome coronavirus 2 wild-type sequence into the protein language model evolving pre-trained models-1v, this study obtained the score for each position mutating to other amino acids and calculated the overall trend of a new variant of concern mutation scores. objective: Analyze the differences in the number of Omicron amino acid mutations compared to the other four VOC mutations using statistical methods, and use the protein language model esm-1v to analyze the specificity of Omicron amino acid mutations. Results: It is found that when the proportion of unobserved mutations to observed mutations is 4:15, Omicron still generates a large number of newly emerging mutations. It was found that the overall score for the Omicron family is low, and the overall ranking for the Omicron family is low. Conclusion: Mutations in the Omicron lineage are different from amino acid mutations in other lineages. The findings of this paper deepen the understanding of the spatial distribution of spike protein amino acid mutations and overall trends of newly emerging mutations corresponding to different variants of concern. This also provides insights into simulating the evolution of the Omicron lineage.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"63 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative Analysis of Deep Generative Model for Industrial Enzyme Design","authors":"Beibei Zhang, Qiaozhen Meng, Chengwei Ai, Guihua Duan, Ercheng Wang, Fei Guo","doi":"10.2174/0115748936303223240404043202","DOIUrl":"https://doi.org/10.2174/0115748936303223240404043202","url":null,"abstract":": Although enzymes have the advantage of efficient catalysis, natural enzymes lack stability in industrial environments and do not even meet the required catalytic reactions. This prompted us to urgently de novo design new enzymes. Computational design is a powerful tool, allowing rapid and efficient exploration of sequence space and facilitating the design of novel enzymes tailored to specific conditions and requirements. It is beneficial to de novo design industrial enzymes using computational methods. Currently, only one tool explicitly designed for the enzyme-only generation performs unsatisfactorily. We have selected several general protein sequence design tools and systematically evaluated their effectiveness when applied to specific industrial enzymes. We investigated the literature related to protein generation. We summarized the computational methods used for sequence generation into three categories: structure-conditional sequence generation, sequence generation without structural constraints, and co-generation of sequence and structure. To effectively evaluate the ability of six computational tools to generate enzyme sequences, we first constructed a luciferase dataset named Luc_64. Then we assessed the quality of enzyme sequences generated by these methods on this dataset, including amino acid distribution, EC number validation, etc. We also assessed sequences generated by structure-based methods on existing public datasets using sequence recovery rates and root-mean-square deviation (RMSD) from a sequence and structure perspective. In the functionality dataset, Luc_64, ABACUS-R, and ProteinMPNN stood out for producing sequences with amino acid distributions and functionalities closely matching those of naturally occurring luciferase enzymes, suggesting their effectiveness in preserving essential enzymatic characteristics. Across both benchmark datasets, ABACUS-R and ProteinMPNN, have also exhibited the highest sequence recovery rates, indicating their superior ability to generate sequences closely resembling the original enzyme structures. Our study provides a crucial reference for researchers selecting appropriate enzyme sequence design tools, highlighting the strengths and limitations of each tool in generating accurate and functional enzyme sequences. ProteinMPNN and ABACUS-R emerged as the most effective tools in our evaluation, offering high accuracy in sequence recovery and RMSD and maintaining the functional integrity of enzymes through accurate amino acid distribution. Meanwhile, the performance of protein general tools for migration to specific industrial enzymes was fairly evaluated on our specific industrial enzyme benchmark.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang
{"title":"An Effective Method to Identify Cooperation Driver Gene Sets","authors":"Wei Zhang, Yifu Zeng, Bihai Zhao, Jie Xiong, Tuanfei Zhu, Jingjing Wang, Guiji Li, Lei Wang","doi":"10.2174/0115748936293238240313081211","DOIUrl":"https://doi.org/10.2174/0115748936293238240313081211","url":null,"abstract":"Background: In cancer genomics research, identifying driver genes is a challenging task. Detecting cancer-driver genes can further our understanding of cancer risk factors and promote the development of personalized treatments. Gene mutations show mutual exclusivity and cooccur, and most of the existing methods focus on identifying driver pathways or driver gene sets through the study of mutual exclusivity, that is functionally redundant gene sets. Moreover, less research on cooperation genes with co-occurring mutations has been conducted. Objective: We propose an effective method that combines the two characteristics of genes, cooccurring mutations and the coordinated regulation of proliferation genes, to explore cooperation driver genes. Methods: This study is divided into three stages: (1) constructing a binary gene mutation matrix; (2) combining mutation co-occurrence characteristics to identify the candidate cooperation gene sets; and (3) constructing a gene regulation network to screen the cooperation gene sets that perform synergistically regulating proliferation. Results: The method performance is evaluated on three TCGA cancer datasets, and the experiments showed that it can detect effective cooperation driver gene sets. In further investigations, it was determined that the discovered set of co-driver genes could be used to generate prognostic classifications, which could be biologically significant and provide complementary information to the cancer genome. Conclusion: Our approach is effective in identifying sets of cancer cooperation driver genes, and the results can be used as clinical markers to stratify patients.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang
{"title":"Integrated Somatic Mutation Network Diffusion Model for Stratification of Breast Cancer into Different Metabolic Mutation Subtypes","authors":"Dongqing Su, Honghao Li, Tao Wang, Min Zou, Haodong Wei, Yuqiang Xiong, Hongmei Sun, Shiyuan Wang, Qilemuge Xi, Yongchun Zuo, Lei Yang","doi":"10.2174/0115748936298012240322091111","DOIUrl":"https://doi.org/10.2174/0115748936298012240322091111","url":null,"abstract":"Background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. background: Mutations in metabolism-related genes in somatic cells potentially lead to disruption of metabolic pathways, which results in patients exhibiting different molecular and pathological features. Objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. objective: In this study, we focused on somatic mutation data to investigate the significance of metabolic mutation typing in guiding the prognosis and treatment of breast cancer patients. Methods: The somatic mutation profile of breast cancer patients was analyzed and smoothed by utilizing a network diffusion model within the protein-protein interaction network to construct a comprehensive somatic mutation network diffusion profile. Subsequently, a deep clustering approach was employed to explore metabolic mutation typing in breast cancer based on integrated metabolic pathway information and the somatic mutation network diffusion profile. In addition, we employed deep neural networks and machine learning prediction models to assess the feasibility of predicting drug responses through somatic mutation network diffusion profiles. Results: Significant differences in prognosis and metabolic heterogeneity were observed among the different metabolic mutation subtypes, characterized by distinct alterations in metabolic pathways and genetic mutations, and these mutational features offered potential targets for subtype-specific therapies. Furthermore, there was a strong consistency between the results of the drug response prediction model constructed on the somatic mutation network diffusion profile and the actual observed drug responses. Conclusion: Metabolic mutation typing of cancer assists in guiding patient prognosis and treatment.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"33 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GB5mCPred: Cross-species 5mc Site Predictor Based on Bootstrap-based Stochastic Gradient Boosting Method for Poaceae","authors":"Dipro Sinha, Tanwy Dasmandal, Md Yeasin, D.C Mishra, Anil Rai, Sunil Archak","doi":"10.2174/0115748936285544231221113226","DOIUrl":"https://doi.org/10.2174/0115748936285544231221113226","url":null,"abstract":"Background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money-intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. background: One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although in-vitro techniques are more effective in detecting epigenetic alterations, they are time and money intensive. Artificial intelligence-based in silico approaches have been used to overcome these obstacles. Aim: This study aimed to develop an ML-based predictor for the detection of 5mC sites in Poaceae. Objective: The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice. Method: In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space. Result: Random Forest gains the maximum accuracy, specificity and MCC, i.e., 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, i.e., 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/). Conclusion: With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"17 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization","authors":"Zixu Wang, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li, Chunyan Li, Xiucai Ye, Tetsuya Sakurai","doi":"10.2174/0115748936285493240307071916","DOIUrl":"https://doi.org/10.2174/0115748936285493240307071916","url":null,"abstract":"Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation. Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process. Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks. Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"509 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xian-Fang Wang, Chong-Yang Ma, Zhi-Yong Du, Yi-Feng Liu, Shao-Hui Ma, Sang Yu, Rui-xia Jin, Dong-qing Wei
{"title":"Research on the Mechanism of Traditional Chinese Medicine Treatment for Diseases caused by Human Coronavirus COVID-19","authors":"Xian-Fang Wang, Chong-Yang Ma, Zhi-Yong Du, Yi-Feng Liu, Shao-Hui Ma, Sang Yu, Rui-xia Jin, Dong-qing Wei","doi":"10.2174/0115748936292599240308102616","DOIUrl":"https://doi.org/10.2174/0115748936292599240308102616","url":null,"abstract":"Background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. background: Human coronaviruses are a large group of viruses that exist widely in nature and multiply through self-replication. Due to its suddenness and variability, it poses a great threat to global human health and is a major problem currently faced by the medical and health fields. Objective: COVID-19 is the seventh known coronavirus that can infect humans. The main purpose of this paper is to analyze the effective components and action targets of the Longyi Zhengqi formula and Lianhua Qingwen formula, study their mechanism of action in the treatment of new coronavirus pneumonia (new coronavirus pneumonia), compare the similarities and differences of their pharmacological effects, and obtain the pharmacodynamic mechanism of the two traditional Chinese medicine compounds. Method: Obtain the effective ingredients and targets of Longyi-Zhengqi Formula and Lianhua- Qingwen Formula from ETCM (Encyclopedia of Traditional Chinese Medicine) and other traditional Chinese medicine databases, use GeneCards database to obtain the relevant targets of COVID-19, and use Cytoscape software to build the component COVID-19 target network of Longyi-Zhengqi Formula and the component COVID-19 target network of Lianhua-Qingwen Formula. STRING was used to construct a protein interaction network and screen key targets. GO (Gene Ontology) was used for enrichment analysis and KEGG (Kyoto Encyclopedia of Genes and Genomes) was used for pathways to find out the targets and pathways related to the treatment of COVID-19. Results: In the GO enrichment analysis results, there are 106 biological processes, 31 cell localization and 28 molecular functions of the intersection PPI network targets of Longyi-Zhengqi Formula- COVID-19, 224 biological processes, 51 cell localization and 55 molecular functions of the intersection PPI network targets of Lianhua-Qingwen Formula-COVID-19. In the KEGG pathway analysis results, the number of targets of Longyi-Zhengqi Formula on the COVID-19 pathway is 7, and the number of targets of Lianhua-Qingwen Formula on the COVID-19 pathway is 19; In the regulation analysis results, Longyi-Zhengqi Formula achieves the effect of treating COVID-19 by regulating IL-6, and Lianhua-Qingwen Formula achieves the effect of treating pneumonia by regulating TLR4. Conclusion: This paper explores the mechanism of action of Longyi-Zhengqi Formula and Lianhua-Qingwen Formula in treating COVID-19 based on the method of network pharmacology, and provides a theoretical basis for traditional Chinese medicine to treat sudden diseases caused by human coronavirus in terms of drug targets and disease interactions. It has certain practical significance.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"15 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140582836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Machine-learning Model to Classify Schizophrenia Using Methylation Data Based on Gene Expression","authors":"Karthikeyan A. Vijayakumar, Gwang-Won Cho","doi":"10.2174/0115748936293407240222113019","DOIUrl":"https://doi.org/10.2174/0115748936293407240222113019","url":null,"abstract":"Introduction: The recent advancement in artificial intelligence has compelled medical research to adapt the technologies. The abundance of molecular data and AI technology has helped in explaining various diseases, even cancers. Schizophrenia is a complex neuropsychological disease whose etiology is unknown. Several gene-wide association studies attempted to narrow down the cause of the disease but did not successfully point out the mechanism behind the disease. There are studies regarding the epigenetic changes in the schizophrenia disease condition, and a classification machine-learning model has been trained using the blood methylation data. Method: In this study, we have demonstrated a novel approach to elucidating the molecular cause of the disease. We used a two-step machine-learning approach to determine the causal molecular markers. By doing so, we developed classification models using both gene expression microarray and methylation microarray data. Result: Our models, because of our novel approach, achieved good classification accuracy with the available data size. We analyzed the important features, and they add up as evidence for the glutamate hypothesis of schizophrenia. Conclusion: In this way, we have demonstrated explaining a disease through machine learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"29 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140105243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade
{"title":"An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction","authors":"Jerry Emmanuel, Itunuoluwa Isewon, Grace Olasehinde, Jelili Oyelade","doi":"10.2174/0115748936286848240108074303","DOIUrl":"https://doi.org/10.2174/0115748936286848240108074303","url":null,"abstract":"Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"285 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140105363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}