Giulio Formenti, Bonhwang Koo, Marco Sollitto, Jennifer Balacco, Nadolina Brajuka, Richard Burhans, Erick Duarte, Alice M Giani, Kirsty McCaffrey, Jack A Medico, Eugene W Myers, Patrik Smeds, Anton Nekrutenko, Erich D Jarvis
{"title":"Evaluation of sequencing reads at scale using rdeval.","authors":"Giulio Formenti, Bonhwang Koo, Marco Sollitto, Jennifer Balacco, Nadolina Brajuka, Richard Burhans, Erick Duarte, Alice M Giani, Kirsty McCaffrey, Jack A Medico, Eugene W Myers, Patrik Smeds, Anton Nekrutenko, Erich D Jarvis","doi":"10.1093/bioinformatics/btaf416","DOIUrl":"10.1093/bioinformatics/btaf416","url":null,"abstract":"<p><strong>Motivation: </strong>Large sequencing datasets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.</p><p><strong>Results: </strong>As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long-read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though the magnitude varies by taxa.</p><p><strong>Availability and implementation: </strong>Rdeval is implemented in C++ for data processing and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval. Documentation is available on ReadTheDocs (https://rdeval-documentation.readthedocs.io). Rdeval is also available in Bioconda and in Galaxy (https://usegalaxy.org). An automated test workflow ensures the consistency of software updates.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12401588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144692750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leonie J Lorenz, Antoine Andréoletti, Tung V N Nguyen, Henning Hermjakob, Richard G FitzJohn, Rahuman S Malik Sheriff, John A Lees
{"title":"SBMLtoOdin and Menelmacar: interactive visualisation of systems biology models for expert and non-expert audiences.","authors":"Leonie J Lorenz, Antoine Andréoletti, Tung V N Nguyen, Henning Hermjakob, Richard G FitzJohn, Rahuman S Malik Sheriff, John A Lees","doi":"10.1093/bioinformatics/btaf484","DOIUrl":"10.1093/bioinformatics/btaf484","url":null,"abstract":"<p><strong>Summary: </strong>Computational models in biology can increase our understanding of biological systems, be used to answer research questions, and make predictions. Accessibility and reusability of computational models is limited and often restricted to experts in programming and mathematics. This is due to the need to implement entire models and solvers from the mathematical notation models are normally presented as. Here, we present SBMLtoOdin, an R package that translates differential equation models in SBML format from the BioModels database into executable R code using the R package odin, allowing researchers to easily reuse models. We also present Menelmacar, a web-based application that provides interactive visualisations of these models by solving their differential equations in the browser. This platform allows non-experts to simulate and investigate models using an easy-to-use interface.</p><p><strong>Availability and implementation: </strong>SBMLtoOdin is published under the open source Apache 2.0 licence at https://github.com/bacpop/SBMLtoOdin and can be installed as an R package. The code for the Menelmacar website is published under the MIT License at https://github.com/bacpop/odinviewer, and the website can be found at https://biomodels.bacpop.org/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12472120/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camille Aucouturier, Nicolas Goardon, Laurent Castéra, Alexandre Atkinson, Thibaut Lavolé, Angélina Legros, Agathe Ricou, Flavie Boulouard, Sophie Krieger, Raphaël Leman
{"title":"Decipher RNA isoform combinations from minigene splicing assays and massive parallel sequencing with MAGIC.","authors":"Camille Aucouturier, Nicolas Goardon, Laurent Castéra, Alexandre Atkinson, Thibaut Lavolé, Angélina Legros, Agathe Ricou, Flavie Boulouard, Sophie Krieger, Raphaël Leman","doi":"10.1093/bioinformatics/btaf525","DOIUrl":"10.1093/bioinformatics/btaf525","url":null,"abstract":"<p><strong>Summary: </strong>Functional testing of RNA using minigene splicing assays is increasingly being realized to demonstrate the effects of variants on splicing. In complex cases, variant pathogenicity is assessed by Sanger sequencing, which can be time consuming and may be replaced by short read sequencing. Moreover, strategies based on long read sequencing of the amplified minigene construct are promising and allow the isoforms to be fully characterized. We introduce MAGIC, a user-friendly tool that first generates the artificial construction genome files required to then perform alignment, assembly and annotation of the isoforms obtained by either short or long read minigene splicing assay sequencing.</p><p><strong>Availability and implementation: </strong>MAGIC is available at https://github.com/LBGC-CFB/MAGIC. Zenodo DOI: 10.5281/zenodo.17052752.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12479391/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure-guided sequence representation learning for generalizable protein function prediction.","authors":"SeokJun On, Yujin Jeong, Eun-Sol Kim","doi":"10.1093/bioinformatics/btaf511","DOIUrl":"10.1093/bioinformatics/btaf511","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately predicting protein function from sequence remains a fundamental yet challenging goal in computational biology. Although recent advances have enabled the reliable prediction of protein 3D structures from sequences, utilizing structural information alone for functional inference has shown limited success. To address this gap, previous work has explored the integration of sequence and structural data by representing proteins as graphs, where residues are modeled as nodes, and spatial proximity defines edges. However, since the number of amino acids can vary significantly between proteins, the resulting graphs, constructed based on amino acids, also differ greatly in size. This large variation poses a challenge, as it becomes extremely difficult to extract generalizable information from graphs of such differing scales accurately. In this work, we propose Structure-guided Sequence Representation Learning, a novel framework that incorporates structural knowledge to extract informative, multiscale features directly from protein sequences. By embedding structural information into a sequence-based learning paradigm, our method captures functionally meaningful representations more effectively. Furthermore, we present a generalizable model architecture designed for multitask learning and inference, offering improved performance and flexibility over traditional task-specific approaches to protein function prediction.</p><p><strong>Results: </strong>In this article, we demonstrate that the proposed novel attention pooling method on protein graphs effectively integrates global structural features and local chemical properties of amino acids in various-length proteins. Through this approach, we improve performance in tasks related to predicting protein functions, functional expression sites, and their relationships with structure and sequence. By effectively extracting the information needed to predict multiple protein functions simultaneously, we improve efficiency by eliminating the need for separate learning.</p><p><strong>Availability and implementation: </strong>The code implementation is available at https://github.com/vanha9/S2RL_protein and has also been archived on zenodo: https://doi.org/10.5281/zenodo.16441001.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sophie-Marie Wind, Thea Reinkens, Yvonne Lisa Behrens, Sarah Sandmann
{"title":"scafari: exploring scDNA-seq data.","authors":"Sophie-Marie Wind, Thea Reinkens, Yvonne Lisa Behrens, Sarah Sandmann","doi":"10.1093/bioinformatics/btaf477","DOIUrl":"10.1093/bioinformatics/btaf477","url":null,"abstract":"<p><strong>Summary: </strong>Recent advances in single-cell sequencing made it possible to not just analyze a cell's individual expression pattern, but to gain insights into a single cell's genome using the cutting-edge technology single-cell DNA sequencing. Mission Bio is, with the Tapestri platform, one of the few providers of this technology. So far, however, there is only little open-source software available for user-friendly processing and quality analysis of this data type. With scafari, we present a tool that offers easy-to-use data quality control as well as explorative variant analyses and visualization.</p><p><strong>Availability and implementation: </strong>scafari is implemented as an R Bioconductor package featuring a shiny application and is available at https://bioconductor.org/packages/scafari.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449247/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mallek Mziou-Sallami, Pierrick Roger, Arnaud Gloaguen, Claire Dandine-Roulland, Thierry Jiogho Ngaho, Solène Brohard, Kévin Muret, Florian Sandron, Eric Bonnet, Jean-Francois Deleuze, Edith Le Floch, Vincent Meyer
{"title":"GNNenrich: a novel method for pathway enrichment analysis based on graph neural network.","authors":"Mallek Mziou-Sallami, Pierrick Roger, Arnaud Gloaguen, Claire Dandine-Roulland, Thierry Jiogho Ngaho, Solène Brohard, Kévin Muret, Florian Sandron, Eric Bonnet, Jean-Francois Deleuze, Edith Le Floch, Vincent Meyer","doi":"10.1093/bioinformatics/btaf478","DOIUrl":"10.1093/bioinformatics/btaf478","url":null,"abstract":"<p><strong>Motivation: </strong>Graph neural network (GNN) models have emerged in many fields and notably for biological networks constituted by genes or proteins and their interactions. The majority of enrichment study methods apply over-representation analysis and gene/protein set scores according to the existing overlap between pathways. Such methods neglect knowledges coming from the interactions between the gene/protein sets. Here, we introduce a novel GNN-based enrichment analysis method called GNNenrich. GNNenrich, through multiple levels of embedding that integrate protein sequence properties and interactions network, establishes functional relationship to support biological interpretation.</p><p><strong>Results: </strong>GNNenrich have been tested and compared to over-representation analysis technique (g:Profiler) and graph-based method (EnrichNet). It demonstrates the capacity to reproduce results provided by others approaches and offers new perspectives for interpretation, returning relevant results supported by protein-protein interactions (PPIs).</p><p><strong>Availability and implementation: </strong>Source code is available at https://gitlab.com/cnrgh/gnn-enrich/gnn-enrich-article-demo.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeffrey J Czajka, Joonhoon Kim, Yinjie J Tang, Kyle R Pomraning, Aindrila Mukhopadhyay, Hector Garcia Martin
{"title":"FluxRETAP: a REaction TArget Prioritization genome-scale modeling technique for selecting genetic targets.","authors":"Jeffrey J Czajka, Joonhoon Kim, Yinjie J Tang, Kyle R Pomraning, Aindrila Mukhopadhyay, Hector Garcia Martin","doi":"10.1093/bioinformatics/btaf471","DOIUrl":"10.1093/bioinformatics/btaf471","url":null,"abstract":"<p><strong>Motivation: </strong>Metabolic engineering is rapidly evolving as a result of new advances in synthetic biology tools and automation platforms that enable high throughput strain construction, as well as the development of machine learning tools (ML) for biology. However, selecting genetic engineering targets that effectively guide the metabolic engineering process is still challenging. ML can provide predictive power for synthetic biology, but current technical limitations prevent the independent use of ML approaches without previous biological knowledge.</p><p><strong>Results: </strong>Here, we present FluxRETAP, a simple and computationally inexpensive method that leverages the prior mechanistic knowledge embedded in genome-scale models for suggesting targets for genetic overexpression, downregulation or deletion, with the final goal of increasing the production of a desired metabolite. This method can provide a list of desirable engineering targets that can be combined with current ML pipelines. FluxRETAP captured 100% of reaction targets experimentally verified to improve Escherichia coli isoprenol production, 50% of targets that experimentally improved taxadiene production in E. coli and ∼60% of genetic targets from a verified minimal constrained cut-set in Pseudomonas putida, while providing additional high priority targets that could be tested. Overall, FluxRETAP is an efficient algorithm for identifying a prioritized list of testable genetic and reaction targets.</p><p><strong>Availability and implementation: </strong>FluxRETAP is implemented in python and released under the creative commons license. The implementation and code are freely available at: https://github.com/JBEI/FluxRETAP.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417073/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Panaln: indexing pangenome for read alignment.","authors":"Lilu Guo, Zongtao He, Hongwei Huo","doi":"10.1093/bioinformatics/btaf476","DOIUrl":"10.1093/bioinformatics/btaf476","url":null,"abstract":"<p><strong>Motivation: </strong>Pangenome indexing is a critical supporting technology in biological sequence analysis such as read alignment applications. The need to accurately identify billions of small sequencing fragments carrying sequencing errors and genomic variants drives the development of scalable and efficient pangenome indexing approach.</p><p><strong>Results: </strong>We propose a new wavelet tree-based approach, called Panaln, for indexing pangenome and introduce a batch computation approach for fast count query over Panaln. We present a simple and effective seeding strategy and develop a pangenome program that uses the seed-and-extend paradigm for read alignment. Experimental results on simulated and real data demonstrate that Panaln uses significantly less space for the compared pangenome methods with generally higher accuracy. We provide a scalable index construction by representing pangenome with a linear model. Additionally, Panaln brings enhanced accuracy compared to the popular single reference methods.</p><p><strong>Availability and implementation: </strong>Package: https://anaconda.org/bioconda/panaln and source code: https://github.com/Lilu-guo/Panaln.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accurate prediction of toxicity peptide and its function using multi-view tensor learning and latent semantic learning framework.","authors":"Ke Yan, Shutao Chen, Bin Liu, Hao Wu","doi":"10.1093/bioinformatics/btaf489","DOIUrl":"10.1093/bioinformatics/btaf489","url":null,"abstract":"<p><strong>Motivation: </strong>Therapeutic peptide is an important ingredient in the treatment of various diseases and drug discovery. The toxicity of peptides is one of the major challenges in peptide drug therapy. With the abundance of therapeutic peptides generated in the post-genomics era, it is a challenge to promptly identify toxicity peptides using computational methods. Although several efforts have been made, few algorithms are designed to identify whether a query peptide exhibits toxicity. Considering the varied levels of biological activities, the toxicity peptides should be further classified into multi-functional peptides.</p><p><strong>Results: </strong>This study introduces a two-level predictor, ToxPre-2L, developed using the multi-view tensor learning and latent semantic learning framework. The proposed method utilized multi-label learning with feature induced labels to avoid the redundancy of information from each view. Then the multi-view tensor learning was employed to establish the latent semantic information among different views, while low-rank constraint learning was leveraged to exploit the correlation information among multi-labels. Finally, we constructed an updated toxicity peptide benchmark dataset to assess the effectiveness of the proposed method. Experimental results demonstrated that ToxPre-2L achieves a better performance than alternative computational methods in the prediction of toxicity peptides and their multi-functional types.</p><p><strong>Availability and implementation: </strong>The source code and data of ToxPre-2L can be accessed at http://bliulab.net/ToxPre-2L.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12457739/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144994587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gianna Serafina Monti, Meritxell Pujolassos, Malu Calle Rosingana, Peter Filzmoser
{"title":"Robust multivariate regression controlling false discoveries for microbiome data.","authors":"Gianna Serafina Monti, Meritxell Pujolassos, Malu Calle Rosingana, Peter Filzmoser","doi":"10.1093/bioinformatics/btaf506","DOIUrl":"10.1093/bioinformatics/btaf506","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding how bacterial species relate to clinical health indicators can reveal microbiome signatures of disease, offering insights into conditions such as obesity or liver disease. However, analyzing such data requires methods that address compositionality, high dimensionality, sparsity, and outliers.</p><p><strong>Results: </strong>We tackle the challenge of identifying microbiome components linked to health indicators through a robust multivariate compositional regression model. Our method addresses the high dimensionality, sparsity, and compositional nature of microbiome data while maintaining control of the false discovery rate (FDR). By incorporating outlier robustness and a derandomization step, we enhance the stability and reproducibility of results, surpassing current techniques like the Multi-Response Knockoff Filter (MRKF). In simulation studies, our method outperforms MRKF in terms of FDR control, power, and robustness. In real data applications, it leads to valuable biological insights, such as identifying microbial species associated with specific clinical parameters.</p><p><strong>Availability and implementation: </strong>Software in R code format, along with synthetic data example illustrations and comprehensive documentation, is available at https://github.com/giannamonti/RobMReg.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12479396/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}