{"title":"Minimum uncertainty as Bayesian network model selection principle.","authors":"Grigoriy Gogoshin, Andrei S Rodin","doi":"10.1186/s12859-025-06104-5","DOIUrl":"10.1186/s12859-025-06104-5","url":null,"abstract":"<p><strong>Background: </strong>Bayesian Network (BN) modeling is a prominent methodology in computational systems biology. However, the incommensurability of datasets frequently encountered in life science domains gives rise to contextual dependence and numerical irregularities in the behavior of model selection criteria (such as MDL, Minimum Description Length) used in BN reconstruction. This renders model features, first and foremost dependency strengths, incomparable and difficult to interpret. In this study, we derive and evaluate a model selection principle that addresses these problems.</p><p><strong>Results: </strong>The objective of the study is attained by (i) approaching model evaluation as a misspecification problem, (ii) estimating the effect that sampling error has on the satisfiability of conditional independence criterion, as reflected by Mutual Information, and (iii) utilizing this error estimate to penalize uncertainty with the novel Minimum Uncertainty (MU) model selection principle. We validate our findings numerically and demonstrate the performance advantages of the MU criterion. Finally, we illustrate the advantages of the new model evaluation framework on real data examples.</p><p><strong>Conclusions: </strong>The new BN model selection principle successfully overcomes performance irregularities observed with MDL, offers a superior average convergence rate in BN reconstruction, and improves the interpretability and universality of resulting BNs, thus enabling direct inter-BN comparisons and evaluations.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"100"},"PeriodicalIF":2.9,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11980298/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143810356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepMethyGene: a deep-learning model to predict gene expression using DNA methylations.","authors":"Yuyao Yan, Xinyi Chai, Jiajun Liu, Sijia Wang, Wenran Li, Tao Huang","doi":"10.1186/s12859-025-06115-2","DOIUrl":"10.1186/s12859-025-06115-2","url":null,"abstract":"<p><p>Gene expression is the basis for cells to achieve various functions, while DNA methylation constitutes a critical epigenetic mechanism governing gene expression regulation. Here we propose DeepMethyGene, an adaptive recursive convolutional neural network model based on ResNet that predicts gene expression using DNA methylation information. Our model transforms methylation Beta values to M values for Gaussian distributed data optimization, dynamically adjusts the output channels according to input dimension, and implements residual blocks to mitigate the problem of gradient vanishing when training very deep networks. Benchmarking against the state-of-the-art geneEXPLORE model (R<sup>2</sup> = 0.449), DeepMethyGene (R<sup>2</sup> = 0.640) demonstrated superior predictive performance. Further analysis revealed that the number of methylation sites and the average distance between these sites and gene transcription start sites (TSS) significantly affected the prediction accuracy. By exploring the complex relationship between methylation and gene expression, this study provides theoretical support for disease progression prediction and clinical intervention. Relevant data and code are available at https://github.com/yaoyao-11/DeepMethyGene .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"99"},"PeriodicalIF":2.9,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11977931/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143810270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zhang, Xinyan Zhang, Justin M Leach, A K M F Rahman, Carrie R Howell, Nengjun Yi
{"title":"Bayesian compositional generalized linear mixed models for disease prediction using microbiome data.","authors":"Li Zhang, Xinyan Zhang, Justin M Leach, A K M F Rahman, Carrie R Howell, Nengjun Yi","doi":"10.1186/s12859-025-06114-3","DOIUrl":"10.1186/s12859-025-06114-3","url":null,"abstract":"<p><p>The primary goal of predictive modeling for compositional microbiome data is to better understand and predict disease susceptibility based on the relative abundance of microbial species. Current approaches in this area often assume a high-dimensional sparse setting, where only a small subset of microbiome features is considered relevant to the outcome. However, in real-world data, both large and small effects frequently coexist, and acknowledging the contribution of smaller effects can significantly enhance predictive performance. To address this challenge, we developed Bayesian Compositional Generalized Linear Mixed Models for Analyzing Microbiome Data (BCGLMM). BCGLMM is capable of identifying both moderate taxa effects and the cumulative impact of numerous minor taxa, which are often overlooked in conventional models. With a sparsity-inducing prior, the structured regularized horseshoe prior, BCGLMM effectively collaborates phylogenetically related moderate effects. The random effect term efficiently captures sample-related minor effects by incorporating sample similarities within its variance-covariance matrix. We fitted the proposed models using Markov Chain Monte Carlo (MCMC) algorithms with rstan. The performance of the proposed method was evaluated through extensive simulation studies, demonstrating its superiority with higher prediction accuracy compared to existing methods. We then applied the proposed method on American Gut Data to predict inflammatory bowel disease (IBD). To ensure reproducibility, the code and data used in this paper are available at https://github.com/Li-Zhang28/BCGLMM .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"98"},"PeriodicalIF":2.9,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11971746/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Uziel Pérez Malla, Jessica Kalla, Andreas Tiefenbacher, Gabriel Wasinger, Kilian Kluge, Gerda Egger, Raheleh Sheibani-Tezerji
{"title":"Goistrat: gene-of-interest-based sample stratification for the evaluation of functional differences.","authors":"Carlos Uziel Pérez Malla, Jessica Kalla, Andreas Tiefenbacher, Gabriel Wasinger, Kilian Kluge, Gerda Egger, Raheleh Sheibani-Tezerji","doi":"10.1186/s12859-025-06109-0","DOIUrl":"10.1186/s12859-025-06109-0","url":null,"abstract":"<p><strong>Purpose: </strong>Understanding the impact of gene expression in pathological processes, such as carcinogenesis, is crucial for understanding the biology of cancer and advancing personalised medicine. Yet, current methods lack biologically-informed-omics approaches to stratify cancer patients effectively, limiting our ability to dissect the underlying molecular mechanisms.</p><p><strong>Results: </strong>To address this gap, we present a novel workflow for the stratification and further analysis of multi-omics samples with matched RNA-Seq data that relies on MSigDB curated gene sets, graph machine learning and ensemble clustering. We compared the performance of our workflow in the top 8 TCGA datasets and showed its clear superiority in separating samples for the study of biological differences. We also applied our workflow to analyse nearly a thousand prostate cancer samples, focusing on the varying expression of the FOLH1 gene, and identified specific pathways such as the PI3K-AKT-mTOR gene sets as well as signatures linked to prostate tumour aggressiveness.</p><p><strong>Conclusion: </strong>Our comprehensive approach provides a novel tool to identify disease-relevant functions of genes of interest (GOI) in large datasets. This integrated approach offers a valuable framework for understanding the role of the expression variation of a GOI in complex diseases and for informing on targeted therapeutic strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"97"},"PeriodicalIF":2.9,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11971790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junfeng Wang, Kuikui Cheng, Chaokun Yan, Huimin Luo, Junwei Luo
{"title":"DconnLoop: a deep learning model for predicting chromatin loops based on multi-source data integration.","authors":"Junfeng Wang, Kuikui Cheng, Chaokun Yan, Huimin Luo, Junwei Luo","doi":"10.1186/s12859-025-06092-6","DOIUrl":"10.1186/s12859-025-06092-6","url":null,"abstract":"<p><strong>Background: </strong>Chromatin loops are critical for the three-dimensional organization of the genome and gene regulation. Accurate identification of chromatin loops is essential for understanding the regulatory mechanisms in disease. However, current mainstream detection methods rely primarily on single-source data, such as Hi-C, which limits these methods' ability to capture the diverse features of chromatin loop structures. In contrast, multi-source data integration and deep learning approaches, though not yet widely applied, hold significant potential.</p><p><strong>Results: </strong>In this study, we developed a method called DconnLoop to integrate Hi-C, ChIP-seq, and ATAC-seq data to predict chromatin loops. This method achieves feature extraction and fusion of multi-source data by integrating residual mechanisms, directional connectivity excitation modules, and interactive feature space decoders. Finally, we apply density estimation and density clustering to the genome-wide prediction results to identify more representative loops. The code is available from https://github.com/kuikui-C/DconnLoop .</p><p><strong>Conclusions: </strong>The results demonstrate that DconnLoop outperforms existing methods in both precision and recall. In various experiments, including Aggregate Peak Analysis and peak enrichment comparisons, DconnLoop consistently shows advantages. Extensive ablation studies and validation across different sequencing depths further confirm DconnLoop's robustness and generalizability.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"96"},"PeriodicalIF":2.9,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11959853/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143762863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas M Lange, Mehmet Gültas, Armin O Schmitt, Felix Heinrich
{"title":"optRF: Optimising random forest stability by determining the optimal number of trees.","authors":"Thomas M Lange, Mehmet Gültas, Armin O Schmitt, Felix Heinrich","doi":"10.1186/s12859-025-06097-1","DOIUrl":"10.1186/s12859-025-06097-1","url":null,"abstract":"<p><p>Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce different models using the same input data. This can have severe consequences on decision-making processes. In this study, we introduce a method to quantify the impact of non-determinism on predictions, variable importance estimates, and decisions based on the predictions or variable importance estimates. Our findings demonstrate that increasing the number of trees in random forests enhances the stability in a non-linear way while computation time increases linearly. Consequently, we conclude that there exists an optimal number of trees for any given data set that maximises the stability without unnecessarily increasing the computation time. Based on these findings, we have developed the R package optRF which models the relationship between the number of trees and the stability of random forest, providing recommendations for the optimal number of trees for any given data set.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"95"},"PeriodicalIF":2.9,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11959736/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143750885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chia Yan Tan, Huey Fang Ong, Chern Hong Lim, Mei Sze Tan, Ean Hin Ooi, KokSheik Wong
{"title":"Amogel: a multi-omics classification framework using associative graph neural networks with prior knowledge for biomarker identification.","authors":"Chia Yan Tan, Huey Fang Ong, Chern Hong Lim, Mei Sze Tan, Ean Hin Ooi, KokSheik Wong","doi":"10.1186/s12859-025-06111-6","DOIUrl":"10.1186/s12859-025-06111-6","url":null,"abstract":"<p><p>The advent of high-throughput sequencing technologies, such as DNA microarray and DNA sequencing, has enabled effective analysis of cancer subtypes and targeted treatment. Furthermore, numerous studies have highlighted the capability of graph neural networks (GNN) to model complex biological systems and capture non-linear interactions in high-throughput data. GNN has proven to be useful in leveraging multiple types of omics data, including prior biological knowledge from various sources, such as transcriptomics, genomics, proteomics, and metabolomics, to improve cancer classification. However, current works do not fully utilize the non-linear learning potential of GNN and lack of the integration ability to analyse high-throughput multi-omics data simultaneously with prior biological knowledge. Nevertheless, relying on limited prior knowledge in generating gene graphs might lead to less accurate classification due to undiscovered significant gene-gene interactions, which may require expert intervention and can be time-consuming. Hence, this study proposes a graph classification model called associative multi-omics graph embedding learning (AMOGEL) to effectively integrate multi-omics datasets and prior knowledge through GNN coupled with association rule mining (ARM). AMOGEL employs an early fusion technique using ARM to mine intra-omics and inter-omics relationships, forming a multi-omics synthetic information graph before the model training. Moreover, AMOGEL introduces multi-dimensional edges, with multi-omics gene associations or edges as the main contributors and prior knowledge edges as auxiliary contributors. Additionally, it uses a gene ranking technique based on attention scores, considering the relationships between neighbouring genes. Several experiments were performed on BRCA and KIPAN cancer subtypes to demonstrate the integration of multi-omics datasets (miRNA, mRNA, and DNA methylation) with prior biological knowledge of protein-protein interactions, KEGG pathways and Gene Ontology. The experimental results showed that the AMOGEL outperformed the current state-of-the-art models in terms of classification accuracy, F1 score and AUC score. The findings of this study represent a crucial step forward in advancing the effective integration of multi-omics data and prior knowledge to improve cancer subtype classification.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"94"},"PeriodicalIF":2.9,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11954243/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143741886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCITUNA: single-cell data integration tool using network alignment.","authors":"Aissa Houdjedj, Yacine Marouf, Mekan Myradov, Süleyman Onur Doğan, Burak Onur Erten, Oznur Tastan, Cesim Erten, Hilal Kazan","doi":"10.1186/s12859-025-06087-3","DOIUrl":"10.1186/s12859-025-06087-3","url":null,"abstract":"<p><strong>Background: </strong>As single-cell genomics experiments increase in complexity and scale, the need to integrate multiple datasets has grown. Such integration enhances cellular feature identification by leveraging larger data volumes. However, batch effects-technical variations arising from differences in labs, times, or protocols-pose a significant challenge. Despite numerous proposed batch correction methods, many still have limitations, such as outputting only dimension-reduced data, relying on computationally intensive models, or resulting in overcorrection for batches with diverse cell type composition.</p><p><strong>Results: </strong>We introduce a novel method for batch effect correction named SCITUNA, a Single-Cell data Integration Tool Using Network Alignment. We perform evaluations on 39 individual batches from four real datasets and a simulated dataset, which include both scRNA-seq and scATAC-seq datasets, spanning multiple organisms and tissues. A thorough comparison of existing batch correction methods using 13 metrics reveals that SCITUNA outperforms current approaches and is successful at preserving biological signals present in the original data. In particular, SCITUNA shows a better performance than the current methods in all the comparisons except for the multiple batch integration of the lung dataset where the difference is 0.004.</p><p><strong>Conclusion: </strong>SCITUNA effectively removes batch effects while retaining the biological signals present in the data. Our extensive experiments reveal that SCITUNA will be a valuable tool for diverse integration tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"92"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951583/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143728349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DMoVGPE: predicting gut microbial associated metabolites profiles with deep mixture of variational Gaussian Process experts.","authors":"Qinghui Weng, Mingyi Hu, Guohao Peng, Jinlin Zhu","doi":"10.1186/s12859-025-06110-7","DOIUrl":"10.1186/s12859-025-06110-7","url":null,"abstract":"<p><strong>Background: </strong>Understanding the metabolic activities of the gut microbiome is vital for deciphering its impact on human health. While direct measurement of these metabolites through metabolomics is effective, it is often expensive and time-consuming. In contrast, microbial composition data obtained through sequencing is more accessible, making it a promising resource for predicting metabolite profiles. However, current computational models frequently face challenges related to limited prediction accuracy, generalizability, and interpretability.</p><p><strong>Method: </strong>Here, we present the Deep Mixture of Variational Gaussian Process Experts (DMoVGPE) model, designed to overcome these issues. DMoVGPE utilizes a dynamic gating mechanism, implemented through a neural network with fully connected layers and dropout for regularization, to select the most relevant Gaussian Process experts. During training, the gating network refines expert selection, dynamically adjusting their contribution based on the input features. The model also incorporates an Automatic Relevance Determination (ARD) mechanism, which assigns relevance scores to microbial features by evaluating their predictive power. Features linked to metabolite profiles are given smaller length scales to increase their influence, while irrelevant features are down-weighted through larger length scales, improving both prediction accuracy and interpretability.</p><p><strong>Conclusions: </strong>Through extensive evaluations on various datasets, DMoVGPE consistently achieves higher prediction performance than existing models. Furthermore, our model reveals significant associations between specific microbial taxa and metabolites, aligning well with findings from existing studies. These results highlight DMoVGPE's potential to provide accurate predictions and to uncover biologically meaningful relationships, paving the way for its application in disease research and personalized healthcare strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"93"},"PeriodicalIF":2.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951675/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143728347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elise Coopman, Svenn D'Hert, Rosa Rademakers, Wouter De Coster
{"title":"Methylmap: visualization of modified nucleotides for large cohort sizes.","authors":"Elise Coopman, Svenn D'Hert, Rosa Rademakers, Wouter De Coster","doi":"10.1186/s12859-025-06106-3","DOIUrl":"10.1186/s12859-025-06106-3","url":null,"abstract":"<p><strong>Background: </strong>Over the years, there has been growing interest in epigenetics, where nucleotide modifications are increasingly recognized for their roles in health and disease. Understanding methylation patterns at the nucleotide level has become pivotal for advancing this field. However, visualizing these modifications, particularly in cohorts of more than a few individuals, remains a challenge.</p><p><strong>Results: </strong>Here, we present methylmap, a tool developed to visualize modified nucleotide frequencies for regions of interest, specifically optimized for cohort sizes with more than a few individuals. Furthermore, methylmap features the visualization of the haplotype-specific methylation status of 226 individuals of the 1000 Genomes Project ONT Sequencing Consortium, sequenced using the Oxford Nanopore Technologies PromethION. This resource provides the research community with a comprehensive and complete overview of genome-wide methylation patterns.</p><p><strong>Conclusions: </strong>Methylmap offers an easy-to-use platform to facilitate epigenetic research. It is available both as a web application at https://methylmap.bioinf.be and as a command-line tool through Bioconda and PyPI. As such, we provide a valuable resource for advancing the understanding of epigenetic modifications in health and disease.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"91"},"PeriodicalIF":2.9,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11948879/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143717904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}