Océane Fourquet, Martin S Krejca, Carola Doerr, Benno Schwikowski
{"title":"Towards the genome-scale discovery of bivariate monotonic classifiers.","authors":"Océane Fourquet, Martin S Krejca, Carola Doerr, Benno Schwikowski","doi":"10.1186/s12859-025-06253-7","DOIUrl":"10.1186/s12859-025-06253-7","url":null,"abstract":"<p><strong>Background: </strong>Bivariate monotonic classifiers (BMCs) are based on pairs of input features. Like many other models used for machine learning, they can capture nonlinear patterns in high-dimensional data. At the same time, they are simple and easy to interpret. Until now, the use of BMCs on a genome scale was hampered by the high computational complexity of the search for pairs of features with a high leave-one-out performance estimate.</p><p><strong>Results: </strong>We introduce the fastBMC algorithm, which drastically speeds up the identification of BMCs. The algorithm is based on a mathematical bound for the BMC performance estimate while maintaining optimality. We show empirically that fastBMC speeds up the computation by a factor of at least 15 already for a small number of features, compared to the traditional approach. For two of the three smaller biomedical datasets that we consider here, the resulting possibility of considering much larger sets of features translates into significantly improved classification performance. As an example of the high degree of interpretability of BMCs, we discuss a straightforward interpretation of a BMC glioblastoma survival predictor, an immediate novel biomedical hypothesis, options for biomedical validation, and treatment implications. In addition, we study the performance of fastBMC on a larger and well-known breast cancer dataset, validating the benefits of the BMCs for biomarker identification and biomedical hypothesis generation.</p><p><strong>Conclusion: </strong>fastBMC enables the rapid construction of robust and interpretable ensemble models using BMC, facilitating the discovery of gene pairs predictive of relevant phenotypes and their interaction in that context.</p><p><strong>Availability: </strong>We provide the first open-source implementation for learning BMCs, a Python implementation of fastBMC in particular, and Python code to reproduce the fastBMC results on real and simulated data in this paper, at https://github.com/oceanefrqt/fastBMC .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"228"},"PeriodicalIF":3.3,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403431/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Kolesch, Marco Sohn, Andreas Rempel, Pia Hippel, Roland Wittler
{"title":"SANS ambages: phylogenomics with abundance-filter, multi-threading, and bootstrapping on amino-acid or genomic sequences.","authors":"Fabian Kolesch, Marco Sohn, Andreas Rempel, Pia Hippel, Roland Wittler","doi":"10.1186/s12859-025-06204-2","DOIUrl":"10.1186/s12859-025-06204-2","url":null,"abstract":"<p><strong>Background: </strong>The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies- a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. SANS is an alignment-free, whole-genome based approach for phylogeny estimation.</p><p><strong>Results: </strong>Here we present a new implementation SANS ambages with a significantly increased application spectrum. It offers additional types of input data, parallelized processing, and bootstrapping. The source code (C++), documentation, and example data are freely available for download at: https://github.com/gi-bielefeld/sans . SANS can also be launched via the web-interface of the CloWM platform- free of charge, with a standard Life Science account: https://clowm.bi.denbi.de/workflows/0194b78f-9696-7402-a2b8-858508733618/ .</p><p><strong>Conclusions: </strong>The new version not only shortens processing time on large datasets immensely by parallelization. Being able to also process amino acid sequences and offering a filter for low-abundant DNA read segments also enables new application cases. Bootstrapping and integrated visualization ease and enrich the interpretation of the resulting phylogenies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"227"},"PeriodicalIF":3.3,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403963/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens
{"title":"Reduction of supervision for biomedical knowledge discovery.","authors":"Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens","doi":"10.1186/s12859-025-06187-0","DOIUrl":"10.1186/s12859-025-06187-0","url":null,"abstract":"<p><strong>Background: </strong>Knowledge discovery in scientific literature is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive, time-consuming, and hinders scalability when exploring new domains.</p><p><strong>Methods and results: </strong>In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins, medications) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods' ability to learn from data with noisy labels. The evaluation on four biomedical benchmark datasets explores the effectiveness of the methods, demonstrating their potential to enable scalable knowledge discovery systems less reliant on annotated datasets.</p><p><strong>Conclusion: </strong>Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision which is crucial to adapting models to varied and changing domains. This study also investigates the use of pointwise binary classification techniques within a weakly supervised framework for knowledge discovery. By gradually decreasing supervision, we assess the robustness of these techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, examining how unsupervised methods can reliably capture complex relationships in biomedical texts. These results suggest an encouraging direction toward scalable, adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"225"},"PeriodicalIF":3.3,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403602/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted overlapping group lasso for integrating prior network knowledge into gene set analysis.","authors":"Dan Huang, Geunsu Jo, Kipoong Kim, Hokeun Sun","doi":"10.1186/s12859-025-06170-9","DOIUrl":"10.1186/s12859-025-06170-9","url":null,"abstract":"<p><strong>Background: </strong>Gene set analysis aims to identify gene sets containing differentially expressed genes between two different experimental conditions. A representative example of gene sets is a gene regulatory network where multiple genes are linked with each other for regulation of gene expression. Most of statistical methods for gene set analysis were designed to capture group-based association signals, ignoring a genetic network structure. Consequently, they often fail to identify gene sets where the number of differentially expressed genes are only a few and they have sparse association signals.</p><p><strong>Results: </strong>We propose a new computational method to utilize prior network knowledge for gene set analysis. The proposed method is essentially combines the coefficient estimates of network-based regularization into overlapping group lasso. Network-based regularization can boost association signals among linked genes while overlapping group lasso performs selection of gene sets including differentially expressed genes. In our extensive simulation study, the performance of the proposed method has been evaluated, compared with the existing methods. We also applied it to gene expression data of The Cancer Genome Atlas Breast Invasive Carcinoma Collection (TCGA-BRCA). We were able to identify cancer-related pathways that were missed by the existing methods.</p><p><strong>Conclusion: </strong>Overlapping group lasso is a regularization method for group selection allowing overlapping variables. Network-based regularization is a variable selection method utilizing graph information among variables. The proposed weighted overlapping group lasso (wOGL) adopts the coefficient estimates of network-based regularization for the weight of overlapping group lasso. Consequently, it can identify gene sets containing differentially expressed genes, utilizing prior network knowledge.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"226"},"PeriodicalIF":3.3,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403420/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective primer design for genotype and subtype detection of highly divergent viruses in large scale genome datasets.","authors":"Burak Demiralay, Tolga Can","doi":"10.1186/s12859-025-06251-9","DOIUrl":"10.1186/s12859-025-06251-9","url":null,"abstract":"<p><p>Identification of microorganisms in a biological sample is a crucial step in diagnostics, pathogen screening, biomedical research, evolutionary studies, agriculture, and biological threat assessment. While progress has been made in studying larger organisms, there is a need for an efficient and scalable method that can handle thousands of whole genomes for organisms with high mutation rates and genetic diversity such as single stranded viruses. In this study, we developed a novel method to identify subsequences for detection of a given species/subspecies in a (meta)genomic sample using the Polymerase Chain Reaction (PCR) method. Species detection in any analysis depends highly on the measurement method and since thermodynamic interactions are critical in PCR, thermodynamics is the main driving force in the proposed methodology. Our method is parallelized in multiple steps and involves extracting all oligonucleotides from target genomes. We then locate the target sites for each oligonucleotide using the constructed suffix array and local alignment followed by thermodynamic interaction assessment. An important requirement for subspecies identification is to avoid amplifying a non-target set of genomes and our method addresses this. We applied our method to three highly divergent viruses; (1) Hepatitis C virus (HCV), where the subtypes differ in 31-33% of nucleotide sites on average, (2) Human immunodeficiency virus (HIV), for which, 25-35% between-subtype and 15-20% within-subtype variation is observed, and (3) the Dengue virus, whose respective genomes (only DENV 1-4) share 60% sequence identity to each other. Using our method, we were able to select oligonucleotides that can identify in silico 99.9% of 1657 HCV genomes, 99.7% of 11,838 HIV genomes, and 95.4% of 4016 Dengue genomes. We also show subspecies identification on genotypes 1-6 of HCV and genotypes 1-4 of the Dengue virus with more than 99.5% true positive and less than 0.05% false positive rate, on average. None of the state-of-the-art methods can produce oligonucleotides with this specificity and sensitivity on highly divergent viral genomes like the ones studied in this article.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"223"},"PeriodicalIF":3.3,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12400757/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiago Cabral Borelli, Alexandre Rossi Paschoal, Ricardo Roberto da Silva
{"title":"DeepSEA: an alignment-free explainable approach to annotate antimicrobial resistance proteins.","authors":"Tiago Cabral Borelli, Alexandre Rossi Paschoal, Ricardo Roberto da Silva","doi":"10.1186/s12859-025-06256-4","DOIUrl":"10.1186/s12859-025-06256-4","url":null,"abstract":"<p><p>Antimicrobial resistance (AMR) is one of the most concerning modern threats as it places a greater burden on health systems than HIV and malaria combined. Current surveillance strategies for tracking antimicrobial resistance (AMR) rely on genomic comparisons and depend on sequence alignment with strict similarity cutoffs of greater than 95%. Therefore, these methods have high false-negative error rates due to a lack of reference sequences with a representative coverage of AMR protein diversity. Deep learning has been used as an alternative to sequence alignment, as artificial neural networks can extract abstract features from data, thereby limiting the need for sequence comparisons. Here, a convolutional neural network (CNN) was trained to differentiate between antimicrobial resistance proteins and non-resistance proteins, and to annotate them in nine resistance classes. Our model demonstrated higher recall values (> 0.9) than the alignment-based approach for all protein classes tested. Additionally, our CNN architecture allowed us to investigate internal states and explain the model classification regarding protein domain feature importance related to antimicrobial molecule inactivation. Finally, we built an open-source bioinformatic tool ( https://github.com/computational-chemical-biology/DeepSEA-project ) that can be used to annotate antimicrobial resistance proteins and provide information on protein domains without sequence alignment.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"224"},"PeriodicalIF":3.3,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12403478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dotplotic: a lightweight visualization tool for BLAST + alignments and genomic annotations.","authors":"Hideyuki Miyazawa, Toshiyuki Oda","doi":"10.1186/s12859-025-06255-5","DOIUrl":"https://doi.org/10.1186/s12859-025-06255-5","url":null,"abstract":"<p><p>With the development of sequencing technologies, chromosome-level genome assemblies have become increasingly common across various organisms, including non-model species. BLAST + is one of the most widely used bioinformatics tools for computing sequence alignments, offering numerous optimizations for speed and scalability. Dot plots, which visualize the similarity between two sequences, are widely used in biological research. However, while many dot plot-generating programs exist, most rely on their own alignment algorithms, and it is uncommon to visualize external BLAST results directly. Here, we present Dotplotic, a lightweight Perl program that generates dot plot-like visualizations based on BLAST output in tabular format. Dotplotic visualizes each alignment as a line connecting the start and end points of the query and subject sequences, with a gradient color indicating sequence identity. It allows users to overlay annotation data from external files onto the plot. Although command-line-based, Dotplotic is implemented as a single script using only core Perl modules, making it easy to install and run across platforms. The program supports standard input for both BLAST results and annotation files, enabling flexible visualization under various conditions, such as filtering specific alignments or displaying selected genomic features like genes or repeats. Dotplotic is an efficient, portable, and easy-to-use visualization tool that enhances the exploration of sequence alignments and serves as a valuable resource for both bioinformatics and broader biological research.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"222"},"PeriodicalIF":3.3,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12392551/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Césaire J K Fouodo, Marina Bleskina, Silke Szymczak
{"title":"fuseMLR: an R package for integrative prediction modeling of multi-omics data.","authors":"Césaire J K Fouodo, Marina Bleskina, Silke Szymczak","doi":"10.1186/s12859-025-06248-4","DOIUrl":"https://doi.org/10.1186/s12859-025-06248-4","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"221"},"PeriodicalIF":3.3,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12382258/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calciumnetexplorer: an R package for network analysis of calcium imaging data.","authors":"Simone Lenci, Dirk Sieger","doi":"10.1186/s12859-025-06206-0","DOIUrl":"https://doi.org/10.1186/s12859-025-06206-0","url":null,"abstract":"<p><strong>Background: </strong>Analyzing calcium imaging data to understand complex functional networks can be challenging, often requiring multiple tools, custom scripts, and some coding expertise. To address these challenges, we present CalciumNetExploreR (CNER), an R package designed to streamline and standardize the analysis of time-series data from neuronal populations.</p><p><strong>Results: </strong>CNER integrates essential steps-normalization, binarization, population activity visualization, network construction, degree distribution analysis, principal component analysis, power spectral density evaluation, and event frequency calculations-into a single, cohesive pipeline. This comprehensive approach enables users to efficiently extract and compare network metrics, including clustering coefficients, global efficiency, community structures, and principal component variances. By offering a flexible and customizable framework, CNER simplifies the examination of functional connectivity and network topology, effectively providing the means to characterize a cellular functional network or analogous structures in other modalities.</p><p><strong>Conclusion: </strong>Designed as a user-friendly package, CNER allows both experimental and computational neuroscientists to incorporate robust statistical and graphical analyses into their workflows without extensive coding knowledge. By unifying key analytical components into one pipeline, CNER reduces barriers associated with large-scale data analyses, ultimately facilitating deeper insights into the functional organization and dynamic properties of neuronal networks across diverse recording techniques.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"220"},"PeriodicalIF":3.3,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12379452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asier Ortega-Legarreta, Alberto Maillo, Daniel Mouzo, Ana Rosa López-Pérez, Lara Kular, Majid Pahlevan Kakhki, Maja Jagodic, Jesper Tegner, Vincenzo Lagani, Ewoud Ewing, David Gomez-Cabrero
{"title":"GeneSetCluster 2.0: a comprehensive toolset for summarizing and integrating gene-sets analysis.","authors":"Asier Ortega-Legarreta, Alberto Maillo, Daniel Mouzo, Ana Rosa López-Pérez, Lara Kular, Majid Pahlevan Kakhki, Maja Jagodic, Jesper Tegner, Vincenzo Lagani, Ewoud Ewing, David Gomez-Cabrero","doi":"10.1186/s12859-025-06249-3","DOIUrl":"https://doi.org/10.1186/s12859-025-06249-3","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"219"},"PeriodicalIF":3.3,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12372222/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}