Bioinformatics advancesPub Date : 2024-10-21eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae159
Mark Ziemann, Barry Schroeter, Anusuiya Bora
{"title":"Two subtle problems with overrepresentation analysis.","authors":"Mark Ziemann, Barry Schroeter, Anusuiya Bora","doi":"10.1093/bioadv/vbae159","DOIUrl":"10.1093/bioadv/vbae159","url":null,"abstract":"<p><strong>Motivation: </strong>Overrepresentation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of 'omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the 'background problem', because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the 'false discovery rate problem', because some tools underestimate the true number of parallel tests conducted.</p><p><strong>Results: </strong>Here, we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset. These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.</p><p><strong>Availability and implementation: </strong>An R/Shiny tool has been provided at https://oratool.ziemann-lab.net/ and the supporting materials are available from Zenodo (https://zenodo.org/records/13823301).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae159"},"PeriodicalIF":2.4,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11557902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-18eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae156
Na Zhao, David L Bennett, Georgios Baskozos, Allison M Barry
{"title":"Predicting 'pain genes': multi-modal data integration using probabilistic classifiers and interaction networks.","authors":"Na Zhao, David L Bennett, Georgios Baskozos, Allison M Barry","doi":"10.1093/bioadv/vbae156","DOIUrl":"https://doi.org/10.1093/bioadv/vbae156","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate identification of pain-related genes remains challenging due to the complex nature of pain pathophysiology and the subjective nature of pain reporting in humans. Here, we use machine learning to identify possible 'pain genes'. Labelling was based on a gold-standard list with validated involvement across pain conditions, and was trained on a selection of -omics, protein-protein interaction network features, and biological function readouts for each gene.</p><p><strong>Results: </strong>The top-performing model was selected to predict a 'pain score' per gene. The top-ranked genes were then validated against pain-related human SNPs. Functional analysis revealed JAK2/STAT3 signal, ErbB, and Rap1 signalling pathways as promising targets for further exploration, while network topological features contribute significantly to the identification of 'pain' genes. As such, a network based on top-ranked genes was constructed to reveal previously uncharacterized pain-related genes. Together, these novel insights into pain pathogenesis can indicate promising directions for future experimental research.</p><p><strong>Availability and implementation: </strong>These analyses can be further explored using the linked open-source database at https://livedataoxford.shinyapps.io/drg-directory/, which is accompanied by a freely accessible code template and user guide for wider adoption across disciplines.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae156"},"PeriodicalIF":2.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549022/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142633759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-14eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae154
Dea Gogishvili, Emmanuel Minois-Genin, Jan van Eck, Sanne Abeln
{"title":"PatchProt: hydrophobic patch prediction using protein foundation models.","authors":"Dea Gogishvili, Emmanuel Minois-Genin, Jan van Eck, Sanne Abeln","doi":"10.1093/bioadv/vbae154","DOIUrl":"10.1093/bioadv/vbae154","url":null,"abstract":"<p><strong>Motivation: </strong>Hydrophobic patches on protein surfaces play important functional roles in protein-protein and protein-ligand interactions. Large hydrophobic surfaces are also involved in the progression of aggregation diseases. Predicting exposed hydrophobic patches from a protein sequence has shown to be a difficult task. Fine-tuning foundation models allows for adapting a model to the specific nuances of a new task using a much smaller dataset. Additionally, multitask deep learning offers a promising solution for addressing data gaps, simultaneously outperforming single-task methods.</p><p><strong>Results: </strong>In this study, we harnessed a recently released leading large language model Evolutionary Scale Models (ESM-2). Efficient fine-tuning of ESM-2 was achieved by leveraging a recently developed parameter-efficient fine-tuning method. This approach enabled comprehensive training of model layers without excessive parameters and without the need to include a computationally expensive multiple sequence analysis. We explored several related tasks, at local (residue) and global (protein) levels, to improve the representation of the model. As a result, our model, PatchProt, cannot only predict hydrophobic patch areas but also outperforms existing methods at predicting primary tasks, including secondary structure and surface accessibility predictions. Importantly, our analysis shows that including related local tasks can improve predictions on more difficult global tasks. This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.</p><p><strong>Availability and implementation: </strong>https://github.com/Deagogishvili/chapter-multi-task.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae154"},"PeriodicalIF":2.4,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11525051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-11eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae153
Greta Bellinzona, Davide Sassera, Alexandre M J J Bonvin
{"title":"Accelerating protein-protein interaction screens with reduced AlphaFold-Multimer sampling.","authors":"Greta Bellinzona, Davide Sassera, Alexandre M J J Bonvin","doi":"10.1093/bioadv/vbae153","DOIUrl":"10.1093/bioadv/vbae153","url":null,"abstract":"<p><strong>Motivation: </strong>Discovering new protein-protein interactions (PPIs) across entire proteomes offers vast potential for understanding novel protein functions and elucidate system properties within or between an organism. While recent advances in computational structural biology, particularly AlphaFold-Multimer, have facilitated this task, scaling for large-scale screenings remains a challenge, requiring significant computational resources.</p><p><strong>Results: </strong>We evaluated the impact of reducing the number of models generated by AlphaFold-Multimer from five to one on the method's ability to distinguish true PPIs from false ones. Our evaluation was conducted on a dataset containing both intra- and inter-species PPIs, which included proteins from bacterial and eukaryotic sources. We demonstrate that reducing the sampling does not compromise the accuracy of the method, offering a faster, efficient, and environmentally friendly solution for PPI predictions.</p><p><strong>Availability and implementation: </strong>The code used in this article is available at https://github.com/MIDIfactory/AlphaFastPPi. Note that the same can be achieved using the latest version of AlphaPulldown available at https://github.com/KosinskiLab/AlphaPulldown.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae153"},"PeriodicalIF":2.4,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11513016/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142513907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae151
Heming Zhang, Dekang Cao, Zirui Chen, Xiuyuan Zhang, Yixin Chen, Cole Sessions, Carlos Cruchaga, Philip Payne, Guangfu Li, Michael Province, Fuhai Li
{"title":"mosGraphGen: a novel tool to generate multi-omics signaling graphs to facilitate integrative and interpretable graph AI model development.","authors":"Heming Zhang, Dekang Cao, Zirui Chen, Xiuyuan Zhang, Yixin Chen, Cole Sessions, Carlos Cruchaga, Philip Payne, Guangfu Li, Michael Province, Fuhai Li","doi":"10.1093/bioadv/vbae151","DOIUrl":"10.1093/bioadv/vbae151","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-omics data, i.e. genomics, epigenomics, transcriptomics, proteomics, characterize cellular complex signaling systems from multi-level and multi-view and provide a holistic view of complex cellular signaling pathways. However, it remains challenging to integrate and interpret multi-omics data for mining critical biomarkers. Graph AI models have been widely used to analyze graph-structure datasets, and are ideal for integrative multi-omics data analysis because they can naturally integrate and represent multi-omics data as a biologically meaningful multi-level signaling graph and interpret multi-omics data via graph node and edge ranking analysis. Nevertheless, it is nontrivial for graph-AI model developers to pre-analyze multi-omics data and convert the data into biologically meaningful graphs, which can be directly fed into graph-AI models.</p><p><strong>Results: </strong>To resolve this challenge, we developed mosGraphGen (multi-omics signaling graph generator), generating Multi-omics Signaling graphs (mos-graph) of individual samples by mapping multi-omics data onto a biologically meaningful multi-level background signaling network with data normalization by aggregating measurements and aligning to the reference genome. With mosGraphGen, AI model developers can directly apply and evaluate their models using these mos-graphs. In the results, mosGraphGen was used and illustrated using two widely used multi-omics datasets of The Cancer Genome Atlas (TCGA) and Alzheimer's disease (AD) samples.</p><p><strong>Availability and implementation: </strong>The code of mosGraphGen is open-source and publicly available via GitHub: https://github.com/FuhaiLiAiLab/mosGraphGen.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae151"},"PeriodicalIF":2.4,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11540438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142592400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae150
Maria Tarradas-Alemany, Sandra Martínez-Puchol, Cristina Mejías-Molina, Marta Itarte, Marta Rusiñol, Sílvia Bofill-Mas, Josep F Abril
{"title":"CAPTVRED: an automated pipeline for viral tracking and discovery from capture-based metagenomics samples.","authors":"Maria Tarradas-Alemany, Sandra Martínez-Puchol, Cristina Mejías-Molina, Marta Itarte, Marta Rusiñol, Sílvia Bofill-Mas, Josep F Abril","doi":"10.1093/bioadv/vbae150","DOIUrl":"https://doi.org/10.1093/bioadv/vbae150","url":null,"abstract":"<p><strong>Summary: </strong>Target Enrichment Sequencing or Capture-based metagenomics has emerged as an approach of interest for viral metagenomics in complex samples. However, these datasets are usually analyzed with standard downstream Bioinformatics analyses. CAPTVRED (<i>Capture-based metagenomics Analysis Pipeline for tracking ViRal species from Environmental Datasets</i>), has been designed to assess the virome present in complex samples, specially focused on those obtained by Target Enrichment Sequencing approach. This work aims to provide a user-friendly tool that complements this sequencing approach for the total or partial virome description, especially from environmental matrices. It includes a setup module which allows preparation and adjustment of the pipeline to any capture panel directed to a set of species of interest. The tool also aims to reduce time and computational cost, as well as to provide comprehensive, reproducible, and accessible results while being easy to costume, set up, and install.</p><p><strong>Availability and implementation: </strong>Source code and test datasets are freely available at github repository: https://github.com/CompGenLabUB/CAPTVRED.git.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae150"},"PeriodicalIF":2.4,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495672/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142513908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-07eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae145
Masato Tsutsui, Mariko Okada
{"title":"DynProfiler: a Python package for comprehensive analysis and interpretation of signaling dynamics leveraged by deep learning techniques.","authors":"Masato Tsutsui, Mariko Okada","doi":"10.1093/bioadv/vbae145","DOIUrl":"10.1093/bioadv/vbae145","url":null,"abstract":"<p><strong>Summary: </strong>Signaling dynamics encode important features and regulatory mechanisms of biological systems, and recent studies have reported the use of simulated signaling dynamics with mechanistic modeling as biomarkers for human diseases. Given the success of deep learning techniques, it is expected that they can extract informative patterns from simulation results more effectively than traditional approaches involving manual feature selection, which can be used for subsequent analyses, such as patient stratification and survival prediction. Here, we propose DynProfiler, which utilizes the entire signaling dynamics, including intermediate variables, as input and leverages deep learning techniques to extract informative features without requiring any labels. Furthermore, DynProfiler incorporates a modern explainable AI solution to provide quantitative time-dependent importance scores for each dynamics. Using simulated dynamics of patients with breast cancer as an example, we demonstrate DynProfiler's ability to extract high-quality features that can predict mortality risk and identify important dynamics, highlighting upregulated phosphorylated GSK3β as a biomarker for poor prognosis. Overall, this tool can be useful for clinical application, as well as for elucidating biological system dynamics.</p><p><strong>Availability and implementation: </strong>The DynProfiler Python library is available in GitHub at https://github.com/okadalabipr/DynProfiler.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae145"},"PeriodicalIF":2.4,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11464416/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-04eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae140
{"title":"Correction to: Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning.","authors":"","doi":"10.1093/bioadv/vbae140","DOIUrl":"https://doi.org/10.1093/bioadv/vbae140","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/bioadv/vbae125.].</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae140"},"PeriodicalIF":2.4,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11453097/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142382608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-03eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae148
Veronica Paparozzi, Christine Nardini
{"title":"tidysbml: R/Bioconductor package for SBML extraction into dataframes.","authors":"Veronica Paparozzi, Christine Nardini","doi":"10.1093/bioadv/vbae148","DOIUrl":"https://doi.org/10.1093/bioadv/vbae148","url":null,"abstract":"<p><strong>Summary: </strong>We present <i>tidysbml</i>, an R package able to perform <i>compartments</i>, <i>species</i>, and <i>reactions</i> data extraction from Systems Biology Markup Language (SBML) documents (up to Level 3) in tabular data structures (i.e. R dataframes) to easily access and handle the richness of the biological information. Thanks to its output format, the package facilitates data manipulation, enabling manageable construction, and therefore analysis, of custom networks, as well as data retrieval, by means of R packages such as <i>igraph</i>, <i>RCy3</i>, and <i>biomaRt</i>. Exemplar data (i.e. SBML files) are extracted from Reactome.</p><p><strong>Availability and implementation: </strong>The <i>tidysbml</i> R package is distributed under CC BY 4.0 License and can be found publicly available in Bioconductor (https://bioconductor.org/packages/tidysbml) and on GitHub (https://github.com/veronicapaparozzi/tidysbml).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae148"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142482213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2024-10-03eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae147
Rodolfo S Allendes Osorio, Yuji Kosugi, Johan T Nyström-Persson, Kenji Mizuguchi, Yayoi Natsume-Kitatani
{"title":"A modern multi-omics data exploration experience with Panomicon.","authors":"Rodolfo S Allendes Osorio, Yuji Kosugi, Johan T Nyström-Persson, Kenji Mizuguchi, Yayoi Natsume-Kitatani","doi":"10.1093/bioadv/vbae147","DOIUrl":"https://doi.org/10.1093/bioadv/vbae147","url":null,"abstract":"<p><strong>Summary: </strong>To address the challenges of the storage, sharing, and analysis of multi-omics data, here we introduce the newest version of Panomicon, which includes the improvement of the underlying data model, the introduction of new registration and control access service, together with the seamless integration with other services (like TargetMine for data enrichment analysis), integrated in a completely new, more user friendly web application.</p><p><strong>Availability and implementation: </strong>Panomicon is available online at https://panomicon.nibiohn.go.jp. Unregistered users can access the publicly available data uploaded to Panomicon using the following account: user: guest, password: anonymous. Source code for the application is also freely available under a GNU license at https://github.com/Toxygates/Panomicon/. A brief user guide for the new features of Panomicon is provided as supplementary material online.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae147"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11520228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}