Bioinformatics advancesPub Date : 2025-03-20eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf044
Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow
{"title":"Biological databases in the age of generative artificial intelligence.","authors":"Mihai Pop, Teresa K Attwood, Judith A Blake, Philip E Bourne, Ana Conesa, Terry Gaasterland, Lawrence Hunter, Carl Kingsford, Oliver Kohlbacher, Thomas Lengauer, Scott Markel, Yves Moreau, William S Noble, Christine Orengo, B F Francis Ouellette, Laxmi Parida, Natasa Przulj, Teresa M Przytycka, Shoba Ranganathan, Russell Schwartz, Alfonso Valencia, Tandy Warnow","doi":"10.1093/bioadv/vbaf044","DOIUrl":"10.1093/bioadv/vbaf044","url":null,"abstract":"<p><strong>Summary: </strong>Modern biological research critically depends on public databases. The introduction and propagation of errors within and across databases can lead to wasted resources as scientists are led astray by bad data or have to conduct expensive validation experiments. The emergence of generative artificial intelligence systems threatens to compound this problem owing to the ease with which massive volumes of synthetic data can be generated. We provide an overview of several key issues that occur within the biological data ecosystem and make several recommendations aimed at reducing data errors and their propagation. We specifically highlight the critical importance of improved educational programs aimed at biologists and life scientists that emphasize best practices in data engineering. We also argue for increased theoretical and empirical research on data provenance, error propagation, and on understanding the impact of errors on analytic pipelines. Furthermore, we recommend enhanced funding for the stewardship and maintenance of public biological databases.</p><p><strong>Availability and implementation: </strong>Not applicable.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf044"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"S2Map: a novel computational platform for identifying secretio-types through cell secretion-signal map.","authors":"Zongliang Yue, Lang Zhou, Peizhen Sun, Xuejia Kang, Fengyuan Huang, Pengyu Chen","doi":"10.1093/bioadv/vbaf059","DOIUrl":"10.1093/bioadv/vbaf059","url":null,"abstract":"<p><strong>Motivation: </strong>Cell communication is predominantly governed by secreted proteins, whose diverse secretion patterns often signify underlying physiological irregularities. Understanding these secreted signals at an individual cell level is crucial for gaining insights into regulatory mechanisms involving various molecular agents. To elucidate the array of cell secretion signals, which encompass different types of biomolecular secretion cues from individual immune cells, we introduce the secretion-signal map (S2Map).</p><p><strong>Results: </strong>S2Map is an online interactive analytical platform designed to explore and interpret distinct cell secretion-signal patterns visually. It incorporates two innovative qualitative metrics, the signal inequality index and the signal coverage index, which are exquisitely sensitive in measuring dissymmetry and diffusion of signals in temporal data. S2Map's innovation lies in its depiction of signals through time-series analysis with multi-layer visualization. We tested the SII and SCI performance in distinguishing the simulated signal diffusion models. S2Map hosts a repository for the single-cell's secretion-signal data for exploring cell secretio-types, a new cell phenotyping based on the cell secretion signal pattern. We anticipate that S2Map will be a powerful tool to delve into the complexities of physiological systems, providing insights into the regulation of protein production, such as cytokines at the remarkable resolution of single cells.</p><p><strong>Availability and implementation: </strong>The S2Map server is publicly accessible via https://au-s2map.streamlit.app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf059"},"PeriodicalIF":2.4,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-17eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf055
Jackie Rao, Paul D W Kirk
{"title":"VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data.","authors":"Jackie Rao, Paul D W Kirk","doi":"10.1093/bioadv/vbaf055","DOIUrl":"10.1093/bioadv/vbaf055","url":null,"abstract":"<p><strong>Summary: </strong>Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratification of patients or samples. However, the growth in availability of high-dimensional categorical data, including 'omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in terms of computational time and scalability, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarization and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas, showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's potential utility in integrative cluster analysis with different 'omics datasets, enabling the discovery of novel disease subtypes.</p><p><strong>Availability and implementation: </strong>VICatMix is freely available as an R package via CRAN, incorporating C++ for faster computation, at https://CRAN.R-project.org/package=VICatMix.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf055"},"PeriodicalIF":2.4,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11981716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144036621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf057
Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert
{"title":"RNApysoforms: fast rendering interactive visualization of RNA isoform structure and expression in Python.","authors":"Bernardo Aguzzoli Heberle, Madeline L Page, Emil K Gustavsson, Mina Ryten, Mark T W Ebbert","doi":"10.1093/bioadv/vbaf057","DOIUrl":"10.1093/bioadv/vbaf057","url":null,"abstract":"<p><strong>Summary: </strong>Alternative splicing generates multiple RNA isoforms from a single gene, enriching genetic diversity and impacting gene function. Effective visualization of these isoforms and their expression patterns is crucial but challenging due to limitations in existing tools. Traditional genome browsers lack programmability, while other tools offer limited customization, produce static plots, or cannot simultaneously display structures and expression levels. RNApysoforms was developed to address these gaps by providing a Python-based package that enables concurrent visualization of RNA isoform structures and expression data. Leveraging plotly and polars libraries, it offers an interactive, customizable, and faster-rendering framework suitable for web applications, enhancing the analysis and dissemination of RNA isoform research.</p><p><strong>Availability and implementation: </strong>RNApysoforms is a Python package available at (https://github.com/UK-SBCoA-EbbertLab/RNApysoforms) and (https://zenodo.org/records/14941190) via an open-source MIT license. It can be easily installed using the pip package installer for Python. Thorough documentation and usage vignettes are available at: https://rna-pysoforms.readthedocs.io/en/latest/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf057"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11964586/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-14eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf056
Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin
{"title":"Challenges in predicting PROTAC-mediated protein-protein interfaces with AlphaFold reveal a general limitation on small interfaces.","authors":"Gilberto P Pereira, Corentin Gouzien, Paulo C T Souza, Juliette Martin","doi":"10.1093/bioadv/vbaf056","DOIUrl":"10.1093/bioadv/vbaf056","url":null,"abstract":"<p><strong>Motivation: </strong>Proteolysis Targeting Chimeras (PROTACs) are heterobifunctional molecules composed by ligands binding to a target protein and a E3-ligase complex, connected by a linker, that induce proximity-based target protein degradation. PROTACs are promising alternatives to conventional drugs against cancer. Predicting PROTAC-mediated complexes is often the first step for <i>in silico</i> PROTAC design pipelines. We previously noted that AlphaFold2 (AF2) fails to predict PROTAC-mediated complexes.</p><p><strong>Results: </strong>Here, we investigate the potential causes of this limitation. We consider a set of 326 protein heterodimers orthogonal to the AF2 training set, and evaluate AF2 models focusing on the interface size and presence of interface ligand. Our results show that AF2-multimer predictions are sensitive to the size of the interface to predict even in the absence of ligands, with the majority of models being incorrect for the smallest interfaces. We also benchmark both AF2 and AF3 on a set of 28 PROTAC-mediated dimers and show that AF3 does not significantly improve upon the accuracy of AF2. The low accuracy of AF2 on complexes with small interfaces has strong implications for computational pipelines for PROTAC design, as these stabilize typically small interfaces, and more generally on any prediction task that involves small interfaces.</p><p><strong>Availability and implementation: </strong>All the models analyzed in this article are available in the Zenodo archive https://zenodo.org/records/14810843.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf056"},"PeriodicalIF":2.4,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143722845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf052
Myriam Brossard, Delnaz Roshandel, Kexin Luo, Fatemeh Yavartanoo, Andrew D Paterson, Yun J Yoo, Shelley B Bull
{"title":"RegionScan: a comprehensive R package for region-level genome-wide association testing with integration and visualization of multiple-variant and single-variant hypothesis testing.","authors":"Myriam Brossard, Delnaz Roshandel, Kexin Luo, Fatemeh Yavartanoo, Andrew D Paterson, Yun J Yoo, Shelley B Bull","doi":"10.1093/bioadv/vbaf052","DOIUrl":"10.1093/bioadv/vbaf052","url":null,"abstract":"<p><strong>Summary: </strong>RegionScan is designed for scalable genome-wide association testing of both multiple-variant and single-variant region-level statistics, with visualization of the results. For detection of association under various regional architectures, it implements three classes of state-of-the-art region-level tests, including multiple-variant linear/logistic regression (with and without dimension reduction), a variance-component score test, and region-level min<i>P</i> tests. RegionScan also supports the analysis of multi-allelic variants and unbalanced binary phenotypes and is compatible with widely used variant call format (VCF) files for both genotyped and imputed variants. Association testing leverages linkage disequilibrium (LD) structure in pre-defined regions, for example, LD-adaptive regions obtained by genomic partitioning, and accommodates parallel processing to improve computational and memory efficiency. Detailed outputs (with allele frequencies, variant-LD bin assignment, single/joint variant effect estimates and region-level results) and utility functions are provided to assist comparison, visualization, and interpretation of results. Thus, RegionScan analysis offers valuable insights into region-level genetic architecture, which supports a wide range of potential applications.</p><p><strong>Availability and implementation: </strong>RegionScan is freely available for download on GitHub (https://github.com/brossardMyriam/RegionScan).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf052"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143756193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf054
Jiqing Zhu, Rebecca Y Wang, Xiaoting Wang, Ricardo Azevedo, Alexander Moreno, Julia A Kuhn, Zia Khan
{"title":"Enhancing gene set overrepresentation analysis with large language models.","authors":"Jiqing Zhu, Rebecca Y Wang, Xiaoting Wang, Ricardo Azevedo, Alexander Moreno, Julia A Kuhn, Zia Khan","doi":"10.1093/bioadv/vbaf054","DOIUrl":"10.1093/bioadv/vbaf054","url":null,"abstract":"<p><strong>Motivation: </strong>Gene set overrepresentation analysis (ORA) is widely used to interpret high-throughput transcriptomics and proteomics data, but traditional methods rely on human-curated gene set databases that lack flexibility.</p><p><strong>Results: </strong>We introduce <i>llm2geneset</i>, a framework that leverages large language models (LLMs) to dynamically generate gene set databases tailored to input query genes, such as differentially expressed genes and a biological context specified in natural language. These databases integrate with methods, such as ORA, to assign biological functions to input genes. Benchmarking against human-curated gene sets demonstrates that LLMs generate gene sets comparable in quality to those curated by humans. <i>llm2geneset</i> also identifies biological processes represented by input gene sets, outperforming traditional ORA and direct LLM prompting. Applying the framework to RNA-seq data from iPSC-derived microglia treated with a <i>TREM2</i> agonist highlights its potential for flexible, context-aware gene set generation and improved interpretation of high-throughput biological data.</p><p><strong>Availability and implementation: </strong><i>llm2geneset</i> is available as open source at https://github.com/Alector-BIO/llm2geneset and via a web interface at https://llm2geneset.streamlit.app.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf054"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12093311/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144121572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-13eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf053
Abdur Rafi, Ahmed Mahir Sultan Rumi, Sheikh Azizul Hakim, Sohaib, Md Toki Tahmid, Rabib Jahin Ibn Momin, Tanjeem Azwad Zaman, Rezwana Reaz, Md Shamsuzzoha Bayzid
{"title":"wQFM-TREE: highly accurate and scalable quartet-based species tree inference from gene trees.","authors":"Abdur Rafi, Ahmed Mahir Sultan Rumi, Sheikh Azizul Hakim, Sohaib, Md Toki Tahmid, Rabib Jahin Ibn Momin, Tanjeem Azwad Zaman, Rezwana Reaz, Md Shamsuzzoha Bayzid","doi":"10.1093/bioadv/vbaf053","DOIUrl":"10.1093/bioadv/vbaf053","url":null,"abstract":"<p><strong>Motivation: </strong>methods are becoming increasingly popular for species tree estimation from multi-locus data in the presence of gene tree discordance. Accurate Species TRee Algorithm (ASTRAL), a leading method in this class, solves the Maximum Quartet Support Species Tree problem within a constrained solution space, while heuristics like Weighted Quartet Fiduccia-Mattheyses (wQFM) and Weighted Quartet MaxCut (wQMC) use weighted quartets and a divide-and-conquer strategy. Recent studies showed wQFM to be more accurate than ASTRAL and wQMC, though its scalability is hindered by the computational demands of explicitly generating and weighting <math><mrow><mi>Θ</mi> <mo>(</mo> <mrow> <msup><mrow><mi>n</mi></mrow> <mn>4</mn></msup> </mrow> <mo>)</mo></mrow> </math> quartets. Here, we introduce wQFM-TREE, a novel summary method that enhances wQFM by avoiding explicit quartet generation and weighting, enabling its application to large datasets.</p><p><strong>Results: </strong>Extensive simulations under diverse and challenging model conditions, with hundreds or thousands of taxa and genes, consistently demonstrate that wQFM-TREE matches or improves upon the accuracy of ASTRAL. It outperformed ASTRAL in 25 of 27 model conditions (statistically significant in 20) involving 200-1000 taxa. Moreover, applying wQFM-TREE to re-analyze the green plant dataset from the One Thousand Plant Transcriptomes Initiative produced a tree highly congruent with established evolutionary relationships of plants. wQFM-TREE's remarkable accuracy and scalability make it a strong competitor to leading methods. Its algorithmic and combinatorial innovations also enhance quartet-based computations, advancing phylogenetic estimation.</p><p><strong>Availability and implementation: </strong>wQFM-TREE is freely available in open source form at https://github.com/abdur-rafi/wQFM-TREE.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf053"},"PeriodicalIF":2.4,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11932941/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143712300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf050
Ankita Pal, Debasisa Mohanty
{"title":"Machine learning-based approach for identification of new resistance associated mutations from whole genome sequences of <i>Mycobacterium tuberculosis</i>.","authors":"Ankita Pal, Debasisa Mohanty","doi":"10.1093/bioadv/vbaf050","DOIUrl":"10.1093/bioadv/vbaf050","url":null,"abstract":"<p><strong>Motivation: </strong>Currently available methods for the prediction of genotypic drug resistance in <i>Mycobacterium tuberculosis</i> utilize information on known markers of drug resistance. Hence, machine learning approaches are needed that can discover new resistance markers.</p><p><strong>Results: </strong>Whole genome sequences with known phenotypic drug resistance profiles have been utilized to train XGBoost and ANN classifiers for 5 first-line and 8 second-line tuberculosis drugs. Benchmarking on a completely independent dataset from CRyPTIC database revealed that our method has high sensitivity (90%-95%) and specificity (94%-99%) for five first-line drugs and robust performance for six second-line drugs with a sensitivity of 77%-89% at over 95% specificity. An explainable AI method, SHapley Additive exPlanations, has successfully identified resistance mutations for each drug in a completely automated way. This approach could not only identify known resistance associated mutations in agreement with the WHO mutation catalogue, but also predicted >100 other potential resistance associated mutations for 13 antibiotics in new genes outside the known resistance loci. Identification of new resistance markers opens up the opportunity for the discovery of novel mechanisms of drug resistance.</p><p><strong>Availability and implementation: </strong>Our prediction method has been implemented as TB-AMRpred webserver and command line tool, available freely at http://www.nii.ac.in/TB-AMRpred.html and https://github.com/Ankitapal1995/TB-AMRprd.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf050"},"PeriodicalIF":2.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11930343/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143694157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bioinformatics advancesPub Date : 2025-03-11eCollection Date: 2025-01-01DOI: 10.1093/bioadv/vbaf045
Asahi Adachi, Fan Zhang, Shigehiko Kanaya, Naoaki Ono
{"title":"Quantifying uncertainty in microbiome-based prediction using Gaussian processes with microbial community dissimilarities.","authors":"Asahi Adachi, Fan Zhang, Shigehiko Kanaya, Naoaki Ono","doi":"10.1093/bioadv/vbaf045","DOIUrl":"10.1093/bioadv/vbaf045","url":null,"abstract":"<p><strong>Summary: </strong>The human microbiome is closely associated with the health and disease of the human host. Machine learning models have recently utilized the human microbiome to predict health conditions and disease status. Quantifying predictive uncertainty is essential for the reliable application of these microbiome-based prediction models in clinical settings. However, uncertainty quantification in such prediction models remains unexplored. In this study, we have developed a probabilistic prediction model using a Gaussian process (GP) with a kernel function that incorporates microbial community dissimilarities. We evaluated the performance of probabilistic prediction across three regression tasks: chronological age, body mass index, and disease severity, using publicly available human gut microbiome datasets. The results demonstrated that our model outperformed existing methods in terms of probabilistic prediction accuracy. Furthermore, we found that the confidence levels closely matched the empirical coverage and that data points predicted with lower uncertainty corresponded to lower prediction errors. These findings suggest that GP regression models incorporating community dissimilarities effectively capture the characteristics of phylogenetic, high-dimensional, and sparse microbial abundance data. Our study provides a more reliable framework for microbiome-based prediction, potentially advancing the application of microbiome data in health monitoring and disease diagnosis in clinical settings.</p><p><strong>Availability and implementation: </strong>The code is available at https://github.com/asahiadachi/gp4microbiome.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf045"},"PeriodicalIF":2.4,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919817/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143665536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}