{"title":"KADAIF: An Anomaly Detection Method for Complex Microbiome Data.","authors":"Omri Peleg, Maya Raytan, Elhanan Borenstein","doi":"10.1093/bioinformatics/btaf520","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf520","url":null,"abstract":"<p><strong>Motivation: </strong>The gut microbiome plays an important role in human health and disease, prompting large-scale studies that generate extensive datasets. A critical preprocessing step in analyzing such datasets is anomaly detection, which aims to identify erroneous samples and prevent misleading statistical outcomes. Microbiome data, however, pose unique challenges such as compositionality, sparsity, interdependencies, and high dimensionality, limiting the effectiveness of conventional methods and highlighting the need for specifically-tailored approaches for anomaly detection in microbiome data.</p><p><strong>Implementation: </strong>To address this challenge, we introduce KADAIF, a microbiome-specific anomaly detection method that generalizes the common Isolation Forest approach. As in Isolation Forest, KADAIF builds an ensemble of trees, each recursively partitioning the data along randomly selected features, and measures the average depth at which samples are isolated, assuming that anomalous samples will be isolated closer to the root. Unlike Isolation Forest, however, KADAIF partitions samples based on subsets of features (coupled with dimensionality reduction), addressing microbiome-specific properties such as sparsity and species interactions.</p><p><strong>Results: </strong>We evaluate KADAIF by simulating common scenarios that introduce anomalous behavior, demonstrating that KADAIF outperforms alternative methods across various settings and datasets. Furthermore, we show that KADAIF outperforms Isolation Forest in detecting anomalies also in other types of high dimensional sparse biological data. Finally, we show KADAIF's application for identifying disease onset in longitudinal microbiome data and for partitioning cases vs controls based on the Anna Karenina principle. Combined, our work highlights KADAIF's potential to enhance microbiome data processing and downstream analyses, with beneficial implications for precision medicine studies.</p><p><strong>Availability: </strong>An implementation of KADAIF, as well as all the code used for the analysis, is available on GitHub (https://github.com/borenstein-lab/KADAIF).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phil Tinn, Sondre Sørbø, Shanshan Jiang, Konstantinos Voutetakis, Sotiris Moudouris Giounis, Eleftherios Pilalis, Olga Papadodima, Dumitru Roman
{"title":"Pre-Meta: Priors-augmented Retrieval for LLM-based Metadata Generation.","authors":"Phil Tinn, Sondre Sørbø, Shanshan Jiang, Konstantinos Voutetakis, Sotiris Moudouris Giounis, Eleftherios Pilalis, Olga Papadodima, Dumitru Roman","doi":"10.1093/bioinformatics/btaf519","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf519","url":null,"abstract":"<p><strong>Motivation: </strong>While high-throughput sequencing technologies have dramatically accelerated genomic data generation, the manual processes required for dataset annotation and metadata creation impede the efficient discovery and publication of these resources across disparate public repositories. Large Language Models (LLMs) have the potential to streamline dataset profiling and discovery. However, their current limitations in generalizing across specialized knowledge domains, particularly in fields such as biomedical genomics, prevent them from fully realizing this potential. This paper presents Pre-Meta, an LLM-agnostic and domain-independent data annotation pipeline with an enriched retrieval procedure that leverages related priors-such as pre-generated metadata tags and ontologies-as auxiliary information to improve the accuracy of automated metadata generation.</p><p><strong>Results: </strong>Validated using five selected metadata fields sampled across 1500 papers, the Pre-Meta assisted annotation experiment-without finetuning and prompt optimization-demonstrates a systemic improvement in the annotation task: shown through a 23%, 72%, and 75% accuracy gain from conventional RAG adoptions of GPT-4o mini, Llama 8B, and Mistral 7B respectively.</p><p><strong>Availability: </strong>The code, data access, and scripts are available at: https://github.com/SINTEF-SE/LLMDap.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PatchWorkPlot: simultaneous visualization of local alignments across multiple sequences.","authors":"Mariia Pospelova, Yana Safonova","doi":"10.1093/bioinformatics/btaf504","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf504","url":null,"abstract":"<p><strong>Motivation: </strong>Revealing structural variations within and across populations is crucial for understanding their diversification mechanisms and roles. Existing tools for visualization of structural variations often require labor-intensive figure preparation and are limited in their ability to integrate annotations.</p><p><strong>Results: </strong>We developed PatchWorkPlot, a tool for automated visualization of pairwise alignments of multiple annotated sequences as dot plots combined into a single matrix. PatchWorkPlot enables exploration of positions, breakpoints, and architectures of structural variations across two or more sequences. The tool supports customization of visualization parameters and produces high-resolution, publication-ready figures. PatchWorkPlot significantly reduces manual work and simplifies the generation of complex plots for various cases, from individual loci to large-scale comparative projects.</p><p><strong>Availability: </strong>PatchWorkPlot is implemented using Python 3 and is publicly available at GitHub: github.com/yana-safonova/PatchWorkPlot.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rok Kogoj, Mauro Petrillo, Samo Zakotnik, Alen Suljič, Miša Korva, Gabriele Leoni
{"title":"Misdetection of frameshifts in SARS-CoV-2 genomes: need for additional harmonisation and efficient monitoring of data workflows.","authors":"Rok Kogoj, Mauro Petrillo, Samo Zakotnik, Alen Suljič, Miša Korva, Gabriele Leoni","doi":"10.1093/bioinformatics/btaf516","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf516","url":null,"abstract":"<p><p>Five years after the outbreak of the SARS-CoV-2 pandemic in 2020, diagnostic laboratories have moved from massive sequencing of thousands of samples to routine surveillance of SARS-CoV-2 cases as with all other respiratory viruses. Surveillance remains of paramount importance to prevent a further SARS-CoV-2 surge, as the virus has been shown to mutate rapidly and can render available drugs and vaccines ineffective. During the pandemic, several bioinformatics pipelines and workflows have been developed to streamline analysis, shorten turn-around time and ensure reproducibility. As the number of samples decreases, laboratories are moving towards more flexible sequencing strategies and optimising the cost per sample. However, workflow redesigns, even if individual steps have proven successful time and time again, can lead to challenges when changes in a bioinformatics pipeline are introduced (e.g., version updates, implementation of new features, etc.), a new combination of viral mutations emerge or, a change in wet-lab procedures lead to unpredictable results. Here we present a report of misidentified frameshift mutations in the consensus sequence of SARS-CoV-2, which led to an incorrect assumption of mutations in the spike and nucleocapsid viral proteins with the potential to affect PCR detection or even antigen testing. This investigation exemplifies the need for better awareness of the challenges that can occur even when using routinely applied protocols and analytical workflows and highlights the need for cooperation between experts of NGS, bioinformaticians and decision-makers towards more harmonised data workflows.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InterVelo: A Mutually Enhancing Model for Estimating Pseudotime and RNA Velocity in Multi-Omic Single-Cell Data.","authors":"Yurou Wang, Zhixiang Lin, Tao Wang","doi":"10.1093/bioinformatics/btaf500","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf500","url":null,"abstract":"<p><strong>Motivation: </strong>RNA velocity has become a powerful tool for uncovering transcriptional dynamics in snapshot single-cell data. However, current RNA velocity approaches often assume constant transcriptional rates and treat genes independently with gene-specific times, which may introduce biases and deviate from biological realities. Here, we present InterVelo, a novel deep learning framework that simultaneously learns cellular pseudotime and RNA velocity.</p><p><strong>Results: </strong>InterVelo leverages an unsupervised cellular time to guide RNA velocity estimation, while the estimated RNA velocity in turn refines the direction of pseudotime. By benchmarking InterVelo against existing methods on both simulated and real datasets, we demonstrate its superior performance in recovering pseudotime and RNA velocity. InterVelo yields more precise velocity estimations in terms of both direction and magnitude, with outstanding robustness across diverse scenarios. Furthermore, it successfully identifies driver genes and enables reliable gene activity enrichment analysis. The flexible architecture of InterVelo also allows for the integration of multi-omic data, enhancing its applicability to complex biological systems.</p><p><strong>Availability: </strong>InterVelo is implemented by python, and the code is available on GitHub https://github.com/yurouwang-rosie/InterVelo and has been archived with a DOI https://doi.org/10.5281/zenodo.16158798 for reproducibility.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guannan Yang, Ellen Menkhorst, Evdokia Dimitriadis, Kim-Anh Lê Cao
{"title":"PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection.","authors":"Guannan Yang, Ellen Menkhorst, Evdokia Dimitriadis, Kim-Anh Lê Cao","doi":"10.1093/bioinformatics/btaf475","DOIUrl":"10.1093/bioinformatics/btaf475","url":null,"abstract":"<p><strong>Motivation: </strong>Integrating the knockoff framework with any variable-selection method delivers stringent false discovery rate (FDR) control without recourse to p-values, offering a powerful alternative for differential expression analysis of high-throughput omics datasets. However, existing knockoff generators rely on restrictive modelling assumptions or coarse approximations that often inflate the FDR when applied to real-world data.</p><p><strong>Results: </strong>We introduce Partial Least Squares Knockoff (PLSKO), an efficient, assumption-free generator that remains robust across diverse omics platforms. Our extensive simulations show that PLSKO is the only method to maintain FDR control with sufficient power in complex non-linear settings. Our semi-simulation studies drawn from RNA-seq, proteomics, metabolomics, and microbiome experiments confirm PLSKO generates valid knockoff variables. In pre-eclampsia multi-omics case studies, we combine PLSKO with Aggregation Knockoff to address the randomness of knockoffs and improve power, and demonstrate the method's ability to recover biologically meaningful features.</p><p><strong>Availability and implementation: </strong>Our proposed algorithm is available on Github (https://github.com/guannan-yang/PLSKO) and Zenodo (https://doi.org/10.5281/zenodo.16879594).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449248/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Herson H M Soares, João P R Romanelli, Patrick J Fleming, Carlos H da Silveira
{"title":"FIBOS: R and python packages for analyzing protein packing and structure.","authors":"Herson H M Soares, João P R Romanelli, Patrick J Fleming, Carlos H da Silveira","doi":"10.1093/bioinformatics/btaf434","DOIUrl":"10.1093/bioinformatics/btaf434","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in the prediction of the 3D structures of most known proteins through machine learning have achieved unprecedented accuracies. However, although these computed models are remarkably good, they still challenge accuracy at the atomic level. The Occluded Surface (OS) algorithm is widely used for atomic packing analysis. But it lacks implementations in high-level languages.</p><p><strong>Results: </strong>We introduce FIBOS, an R and Python package incorporating the OS methodology with enhancements. We show how FIBOS can be used to atomically compare experimental structures and AlphaFold predictions. Although the average packing was similar, AlphaFold models exhibited slightly greater variability, revealing a specific pattern of outliers.</p><p><strong>Availability and implementation: </strong>FIBOS can be installed locally as a PyPi Python or CRAN R package, and it is also available at https://github.com/insilico-unifei/fibos-R and https://github.com/insilico-unifei/fibos-py.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144786122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shumin Yang, Yuhan Su, Yuchen Lin, Qin Lin, Zhong Chen
{"title":"PF-AGCN: an adaptive graph convolutional network for protein-protein interaction-based function prediction.","authors":"Shumin Yang, Yuhan Su, Yuchen Lin, Qin Lin, Zhong Chen","doi":"10.1093/bioinformatics/btaf473","DOIUrl":"10.1093/bioinformatics/btaf473","url":null,"abstract":"<p><strong>Motivation: </strong>Proteins carry out most biological processes via interactions with other proteins, known as protein-protein interactions (PPIs). Accurately predicting PPIs is crucial for understanding protein function, yet existing methods often fall short in capturing their complex and hierarchical nature.</p><p><strong>Results: </strong>We propose PF-AGCN, an adaptive graph convolutional network that leverages two distinct graph structures: a function graph representing hierarchical Gene Ontology term relationships and a protein graph modeling direct interactions between proteins. Unlike traditional graph attention networks, PF-AGCN preserves the original biological structures while dynamically learning new relationships, ensuring the retention of essential biological information. Additionally, our framework integrates a protein language model with stacked dilated causal convolutional neural networks, enabling the synergistic fusion of global sequence semantics and local structural patterns. Extensive experiments on a comprehensive protein dataset across three evaluation facets demonstrate PF-AGCN's superior prediction accuracy.</p><p><strong>Availability and implementation: </strong>The source code is publicly available at https://github.com/smyang107/PFAGCN.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bencong Zhu, Alberto Cassese, Marina Vannucci, Michele Guindani, Qiwei Li
{"title":"BISON: bi-clustering of spatial omics data with feature selection.","authors":"Bencong Zhu, Alberto Cassese, Marina Vannucci, Michele Guindani, Qiwei Li","doi":"10.1093/bioinformatics/btaf495","DOIUrl":"10.1093/bioinformatics/btaf495","url":null,"abstract":"<p><strong>Motivation: </strong>The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Understanding gene functions and interactions in different spatial domains is crucial, as it can enhance our comprehension of biological mechanisms, such as cancer-immune interactions and cell differentiation in various regions. It is necessary to cluster tissue regions into distinct spatial domains and identify discriminating genes (DGs) that elucidate the clustering result, referred to as spatial domain-specific DGs. Existing methods for identifying these genes typically rely on a two-stage approach, which can lead to the phenomenon known as double-dipping.</p><p><strong>Results: </strong>To address the challenge, we propose a unified Bayesian latent block model that simultaneously detects a list of DGs contributing to spatial domain identification while clustering these DGs and spatial locations. The efficacy of our proposed method is validated through a series of simulation experiments, and its capability to identify DGs is demonstrated through applications to benchmark SRT datasets.</p><p><strong>Availability and implementation: </strong>The R/C++ implementation of BISON is available at https://github.com/new-zbc/BISON.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12463466/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145031437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"getDNB: identifying dynamic network biomarkers of hepatocellular carcinoma from time-varying gene regulations utilizing graph embedding techniques for anomaly detection.","authors":"Tong Wang, Zhi-Ping Liu","doi":"10.1093/bioinformatics/btaf518","DOIUrl":"10.1093/bioinformatics/btaf518","url":null,"abstract":"<p><strong>Motivation: </strong>Early detection and timely intervention of hepatocellular carcinoma (HCC) are pivotal for improving patient prognosis. Current diagnostic approaches often detect HCC at later stages, thereby diminishing treatment efficacy. Recent advancements in high-throughput sequencing technology have vastly improved the identification of molecular markers via biological networks. However, existing methodologies frequently overlook the intricate gene interaction information in temporal gene regulatory networks. Therefore, our study proposes an algorithm model, getDNB, leveraging graph embedding technique (get) for anomaly detection in time-varying dynamic networks. The model aims to facilitate early HCC detection and propel precision medicine by recognizing dynamic network biomarker (DNB).</p><p><strong>Results: </strong>We proposed the getDNB model, which utilizes graph convolutional networks for graph embedding, mapping high-dimensional gene regulatory networks to low-dimensional feature vector spaces. By calculating gene anomaly degrees through an outlier score, and using the minimum dominant set algorithm alongside with the shortest path algorithm, we discovered DNBs and their associated networks in HCC. The getDNB model successfully pinpointed 33 HCC DNBs, effectively differentiating various temporal stages of HCC progression, and demonstrated robustness across numerous real HCC datasets. Functional enrichment analysis unveiled that these DNBs play critical roles in HCC occurrence and development, outperforming widely used feature selection algorithms.</p><p><strong>Availability and implementation: </strong>The source code and data can be found at https://github.com/zpliulab/getDNB.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12461858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145093173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}