Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández
{"title":"Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases.","authors":"Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández","doi":"10.1515/jib-2024-0054","DOIUrl":"https://doi.org/10.1515/jib-2024-0054","url":null,"abstract":"<p><p>The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143754930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández
{"title":"Fcodes update: a kinship encoding framework with F-Tree GUI & LLM inference.","authors":"Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández","doi":"10.1515/jib-2024-0046","DOIUrl":"https://doi.org/10.1515/jib-2024-0046","url":null,"abstract":"<p><p>Family structures play a crucial role in personal development, social dynamics, and mental health. Traditional systems for encoding genealogical data, such as Ahnentafel and the Register System, offer methods to document lineage but face limitations, particularly in accommodating horizontal relationships or handling changes in family datasets. Modern computational systems like LINKAGE and PED, while powerful for genetic analysis, lack human readability and are challenging to apply in fields where unstructured, narrative data is common, such as sociology or psychiatry. This paper aims to bridge this gap by enhancing Fcodes, a flexible and intuitive algorithm for encoding kinship relationships that is suited for both manual and computational use. Building on our previous work, we present improvements to the Fcodes core algorithm and command-line interface (CLI), as well as the development of F-Tree, a new graphical user interface (GUI) to streamline the encoding process. Additionally, we introduce a method for estimating the coefficient of inbreeding using Fcodes and explore the application of artificial intelligence, namely large language models (LLMs), to automatically infer family relationships from narrative text. These advancements highlight the potential of Fcodes in a wide range of research contexts, from social studies to genetics and mental health research.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Capela, João Cheixo, Dick de Ridder, Oscar Dias, Miguel Rocha
{"title":"Predicting precursors of plant specialized metabolites using DeepMol automated machine learning.","authors":"João Capela, João Cheixo, Dick de Ridder, Oscar Dias, Miguel Rocha","doi":"10.1515/jib-2024-0050","DOIUrl":"https://doi.org/10.1515/jib-2024-0050","url":null,"abstract":"<p><p>Plants produce specialized metabolites, which play critical roles in defending against biotic and abiotic stresses. Due to their chemical diversity and bioactivity, these compounds have significant economic implications, particularly in the pharmaceutical and agrotechnology sectors. Despite their importance, the biosynthetic pathways of these metabolites remain largely unresolved. Automating the prediction of their precursors, derived from primary metabolism, is essential for accelerating pathway discovery. Using DeepMol's automated machine learning engine, we found that regularized linear classifiers offer optimal, accurate, and interpretable models for this task, outperforming state-of-the-art models while providing chemical insights into their predictions. The pipeline and models are available at the repository: https://github.com/jcapels/SMPrecursorPredictor.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143658772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami
{"title":"Designing an optimized theta-defensin peptide for HIV therapy using in-silico approaches.","authors":"Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami","doi":"10.1515/jib-2023-0053","DOIUrl":"https://doi.org/10.1515/jib-2023-0053","url":null,"abstract":"<p><p>The glycoproteins 41 (gp41) of human immunodeficiency virus (HIV), located on the virus's external surface, form six-helix bundles that facilitate viral entry into the host cell. Theta defensins, cyclic peptides, inhibit the formation of these bundles by binding to the GP41 CHR region. RC101, a synthetic analog of theta-defensin molecules, exhibits activity against various HIV subtypes. Molecular docking of the CHR and RC101 was done using MDockPeP and Hawdock server. The type of bonds and the essential amino acids in binding were identified using AlphaFold3, CHIMERA, RING, and CYTOSCAPE. Mutable amino acids within the peptide were determined using the CUPSAT and Duet. Thirty-two new peptides were designed, and their interaction with the CHR of the gp41 was analyzed. The physicochemical properties, toxicity, allergenicity, and antigenicity of peptides were also investigated. Most of the designed peptides exhibited higher binding affinities to the target compared to RC101; notably, peptides 1 and 4 had the highest binding affinity and demonstrated a greater percentage of interactions with critical amino acids of CHR. Peptides A and E displayed the best physiochemical properties among designed peptides. The designed peptides may present a new generation of anti-HIV drugs, which may reduce the likelihood of drug resistance.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143651943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Irvan Faizal, Darrian Chandra, Tarwadi, Sabar Pambudi, Astutiati Nurhasanah, Rizky Priambodo, Muhammad Yusuf
{"title":"Immunoinformatics-guided design of a multiepitope peptide vaccine targeting the receptor-binding domain of SARS-CoV-2 spike glycoprotein: insights from Indonesian samples.","authors":"Irvan Faizal, Darrian Chandra, Tarwadi, Sabar Pambudi, Astutiati Nurhasanah, Rizky Priambodo, Muhammad Yusuf","doi":"10.1515/jib-2024-0025","DOIUrl":"https://doi.org/10.1515/jib-2024-0025","url":null,"abstract":"<p><p>The emergence of new variants of SARS-CoV-2, including Alpha, Beta, Gamma, Delta, Omicron variants, and XBB sub-variants, contributes to the number of coronavirus cases worldwide. SARS-CoV-2 is a positive RNA virus with a genome of 29.9 kb that encodes four structural proteins: spike glycoprotein (S), envelope glycoprotein (E), membrane glycoprotein (M), and nucleocapsid glycoprotein (N). These proteins are vital for viral activity, with the S protein facilitating attachment and membrane fusion through the receptor-binding domain (RBD) located in the S1 subunit. The RBD recognizes and binds to the human angiotensin-converting enzyme 2 (ACE-2) protein. An immunoinformatic-aided design of a peptide-based multiepitope vaccine candidate targeting the RBD glycoprotein is constructed from the SARS-CoV-2 sequence data base from various regions of Indonesia (Jakarta, West Java, and Bali). The results show that the RBD region of with accession ID EPI_ISL_15982641 from West Java had the highest antigenicity of 0.5904. This isolate is non-toxic and non-allergenic and shows a total of 18 LBL epitopes, 72 CTL epitopes, and 98 HTL epitopes. The epitope that has the best overall binding affinity was GCHNKCAY for MHC-I and GGCVFSYVGCHNKCAYWV for MHC-II which show a binding affinity of -13.6 and -15.5 (kcal/mol), respectively. Therefore, this study aims to design an epitope vaccine candidate based on samples from Indonesia that has good characteristics and may have the potential to stimulate an immune response against SARS-CoV-2.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Javad Bazyari, Seyed Hamid Aghaee-Bakhtiari
{"title":"MiRNA target enrichment analysis of co-expression network modules reveals important miRNAs and their roles in breast cancer progression.","authors":"Mohammad Javad Bazyari, Seyed Hamid Aghaee-Bakhtiari","doi":"10.1515/jib-2022-0036","DOIUrl":"10.1515/jib-2022-0036","url":null,"abstract":"<p><p>Breast cancer has the highest incidence and is the fifth cause of death in cancers. Progression is one of the important features of breast cancer which makes it a life-threatening cancer. MicroRNAs are small RNA molecules that have pivotal roles in the regulation of gene expression and they control different properties in breast cancer such as progression. Recently, systems biology offers novel approaches to study complicated biological systems like miRNAs to find their regulatory roles. One of these approaches is analysis of weighted co-expression network in which genes with similar expression patterns are considered as a single module. Because the genes in one module have similar expression, it is rational to think the same regulatory elements such as miRNAs control their expression. Herein, we use WGCNA to find important modules related to breast cancer progression and use hypergeometric test to perform miRNA target enrichment analysis and find important miRNAs. Also, we use negative correlation between miRNA expression and modules as the second filter to ensure choosing the right candidate miRNAs regarding to important modules. We found hsa-mir-23b, hsa-let-7b and hsa-mir-30a are important miRNAs in breast cancer and also investigated their roles in breast cancer progression.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142883636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the therapeutic potential of <i>Asparagus africanus</i> in polycystic ovarian syndrome: a computational analysis.","authors":"Sania Riaz, Fatima Haider, Rizwan- Ur-Rehman, Aqsa Zafar","doi":"10.1515/jib-2024-0019","DOIUrl":"10.1515/jib-2024-0019","url":null,"abstract":"<p><p>PCOS is a multifaceted condition characterized by ovarian abnormalities, metabolic disorders, anovulation, and hormonal imbalances. In response to the growing demand for treatments with fewer side effects, the exploration of herbal-origin drugs has gained prominence. <i>Asparagus africanus</i>, a traditional medicinal plant that exhibits anti-inflammatory, antioxidant, and anti-androgenic properties may have a cure for PCOS. The plant has rich biochemical profile prompted its exploration as a potential source for drug development. The aim of this study is to investigate the potential therapeutic efficacy of <i>A. africanus</i> in the management of PCOS through molecular docking studies with Luteinizing Hormone Receptor and Follicle-Stimulating Hormone Receptor proteins. The identified compounds underwent molecular docking against key proteins associated with PCOS, namely Luteinizing Hormone Receptor and Follicle-Stimulating Hormone Receptor. The results underscored the lead compound's superiority, demonstrating favorable pharmacokinetics, ADME characteristics, and strong molecular binding without any observed toxicity in comparison to standard drug. This study, by leveraging natural compounds sourced from <i>A. africanus</i>, provides valuable insights and advances towards developing more effective and safer treatments for PCOS. The findings contribute to the evolving landscape of PCOS therapeutics, emphasizing the potential of herbal-origin drugs in mitigating the complexities of this syndrome.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698622/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142808595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabor Balogh, Natasha Jorge, Célia Dupain, Maud Kamal, Nicolas Servant, Christophe Le Tourneau, Peter F Stadler, Stephan H Bernhart
{"title":"TREMSUCS-TCGA - an integrated workflow for the identification of biomarkers for treatment success.","authors":"Gabor Balogh, Natasha Jorge, Célia Dupain, Maud Kamal, Nicolas Servant, Christophe Le Tourneau, Peter F Stadler, Stephan H Bernhart","doi":"10.1515/jib-2024-0031","DOIUrl":"10.1515/jib-2024-0031","url":null,"abstract":"<p><p>Many publicly available databases provide disease related data, that makes it possible to link genomic data to medical and meta-data. The cancer genome atlas (TCGA), for example, compiles tens of thousand of datasets covering a wide array of cancer types. Here we introduce an interactive and highly automatized TCGA-based workflow that links and analyses epigenomic and transcriptomic data with treatment and survival data in order to identify possible biomarkers that indicate treatment success. TREMSUCS-TCGA is flexible with respect to type of cancer and treatment and provides standard methods for differential expression analysis or DMR detection. Furthermore, it makes it possible to examine several cancer types together in a pan-cancer type approach. Parallelisation and reproducibility of all steps is ensured with the workflowmanagement system Snakemake. TREMSUCS-TCGA produces a comprehensive single report file which holds all relevant results in descriptive and tabular form that can be explored in an interactive manner. As a showcase application we describe a comprehensive analysis of the available data for the combination of patients with squamous cell carcinomas of head and neck, cervix and lung treated with cisplatin, carboplatin and the combination of carboplatin and paclitaxel. The best ranked biomarker candidates are discussed in the light of the existing literature, indicating plausible causal relationships to the relevant cancer entities.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698617/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142803005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ion channel classification through machine learning and protein language model embeddings.","authors":"Hamed Ghazikhani, Gregory Butler","doi":"10.1515/jib-2023-0047","DOIUrl":"10.1515/jib-2023-0047","url":null,"abstract":"<p><p>Ion channels are critical membrane proteins that regulate ion flux across cellular membranes, influencing numerous biological functions. The resource-intensive nature of traditional wet lab experiments for ion channel identification has led to an increasing emphasis on computational techniques. This study extends our previous work on protein language models for ion channel prediction, significantly advancing the methodology and performance. We employ a comprehensive array of machine learning algorithms, including k-Nearest Neighbors, Random Forest, Support Vector Machines, and Feed-Forward Neural Networks, alongside a novel Convolutional Neural Network (CNN) approach. These methods leverage fine-tuned embeddings from ProtBERT, ProtBERT-BFD, and MembraneBERT to differentiate ion channels from non-ion channels. Our empirical findings demonstrate that TooT-BERT-CNN-C, which combines features from ProtBERT-BFD and a CNN, substantially surpasses existing benchmarks. On our original dataset, it achieves a Matthews Correlation Coefficient (MCC) of 0.8584 and an accuracy of 98.35 %. More impressively, on a newly curated, larger dataset (DS-Cv2), it attains an MCC of 0.9492 and an ROC AUC of 0.9968 on the independent test set. These results not only highlight the power of integrating protein language models with deep learning for ion channel classification but also underscore the importance of using up-to-date, comprehensive datasets in bioinformatics tasks. Our approach represents a significant advancement in computational methods for ion channel identification, with potential implications for accelerating research in ion channel biology and aiding drug discovery efforts.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698620/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge García Brizuela, Carsten Scharfenberg, Carmen Scheuner, Florian Hoedt, Patrick König, Angela Kranz, Antonia Leidel, Daniel Martini, Gabriel Schneider, Julian Schneider, Lea Sophie Singson, Harald von Waldow, Nils Wehrmeyer, Björn Usadel, Stephan Lesch, Xenia Specka, Matthias Lange, Daniel Arend
{"title":"A roadmap for a middleware as a federation service for integrative data retrieval of agricultural data.","authors":"Jorge García Brizuela, Carsten Scharfenberg, Carmen Scheuner, Florian Hoedt, Patrick König, Angela Kranz, Antonia Leidel, Daniel Martini, Gabriel Schneider, Julian Schneider, Lea Sophie Singson, Harald von Waldow, Nils Wehrmeyer, Björn Usadel, Stephan Lesch, Xenia Specka, Matthias Lange, Daniel Arend","doi":"10.1515/jib-2024-0027","DOIUrl":"10.1515/jib-2024-0027","url":null,"abstract":"<p><p>Agriculture is confronted with several challenges such as climate change, the loss of biodiversity and stagnating productivity. The massive increasing amount of data and new digital technologies promise to overcome them, but they necessitate careful data integration and data management to make them usable. The FAIRagro consortium is part of the National Research Data Infrastructure (NFDI) in Germany and will develop FAIR compliant infrastructure services for the agrosystems science community, which will be integrated in the existing research data infrastructure service landscape. Here we present the initial steps of designing and implementing the FAIRagro middleware infrastructure to connect existing data infrastructures. The middleware will feature services for the seamless data integration across diverse infrastructures. Data and metadata are streamlined for research in agrosystems science by downstream processing in the central FAIRagro Search and Inventory Portal and the data integration and analysis workflow system \"SciWIn\".</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11602230/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142585141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}