{"title":"GenomeDecoder: inferring segmental duplications in highly repetitive genomic regions.","authors":"Zhenmiao Zhang, Ishaan Gupta, Pavel A Pevzner","doi":"10.1093/bioinformatics/btaf058","DOIUrl":"10.1093/bioinformatics/btaf058","url":null,"abstract":"<p><strong>Motivation: </strong>The emergence of the 'telomere-to-telomere' genomics brought the challenge of identifying segmental duplications (SDs) in complete genomes. It further opened a possibility for identifying the differences in SDs across individual human genomes and studying the SD evolution. These newly emerged challenges require algorithms for reconstructing SDs in the most complex genomic regions that evaded all previous attempts to analyze their architecture, such as rapidly evolving immunoglobulin loci.</p><p><strong>Results: </strong>We describe the GenomeDecoder algorithm for inferring SDs and apply it to analyzing genomic architectures of various loci in primate genomes. Our analysis revealed that multiple duplications/deletions led to a rapid birth/death of immunoglobulin genes within the human population and large changes in genomic architecture of immunoglobulin loci across primate genomes. Comparison of immunoglobulin loci across primate genomes suggests that they are subjected to diversifying selection.</p><p><strong>Availability and implementation: </strong>GenomeDecoder is available at https://github.com/ZhangZhenmiao/GenomeDecoder. The software version and test data used in this paper are uploaded to https://doi.org/10.5281/zenodo.14753844.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143257344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi
{"title":"VirDetector: a bioinformatic pipeline for virus surveillance using nanopore sequencing.","authors":"Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi","doi":"10.1093/bioinformatics/btaf029","DOIUrl":"10.1093/bioinformatics/btaf029","url":null,"abstract":"<p><strong>Summary: </strong>Virus surveillance programmes are designed to counter the growing threat of viral outbreaks to human health. Nanopore sequencing, in particular, has proven to be suitable for this purpose, as it is readily available and provides rapid results. However, as special bioinformatic programs are required to extract the relevant information from the sequencing data, applications are needed that allow users without extensive bioinformatics knowledge to carry out the relevant analysis steps. We present VirDetector, a bioinformatic pipeline for virus surveillance using nanopore sequencing. The pipeline automatically installs all required programs and databases and allows all its steps to be executed with a single console command. After preprocessing the samples, including the possibility for basecalling, the pipeline classifies each sample taxonomically and reconstructs the viral consensus genomes, which are then used in phylogenetic analyses. This streamlined workflow provides a user-friendly and efficient solution for monitoring viral pathogens.</p><p><strong>Availability and implementation: </strong>VirDetector is freely available at https://github.com/NLKaiser/VirDetector and https://zenodo.org/records/14637302 (10.5281/zenodo.14637302).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802467/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CRAmed: a conditional randomization test for high-dimensional mediation analysis in sparse microbiome data.","authors":"Tiantian Liu, Xiangnan Xu, Tao Wang, Peirong Xu","doi":"10.1093/bioinformatics/btaf038","DOIUrl":"10.1093/bioinformatics/btaf038","url":null,"abstract":"<p><strong>Motivation: </strong>Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.</p><p><strong>Results: </strong>We introduced a novel statistical framework, CRAmed, for inferring the mediating role of the microbiome between treatment and outcome. CRAmed improved the interpretability of the mediation analysis by decomposing the natural indirect effect into two parts, corresponding to the presence-absence and abundance of a microbe, respectively. Comprehensive simulations demonstrated the superior performance of CRAmed in Recall, precision, and F1 score, with a notable level of robustness, compared to existing mediation analysis methods. Furthermore, two real data applications illustrated the effectiveness and interpretability of CRAmed. Our research revealed that CRAmed holds promise for uncovering the mediating role of the microbiome and understanding of the factors influencing host health.</p><p><strong>Availability and implementation: </strong>The R package CRAmed implementing the proposed methods is available online at https://github.com/liudoubletian/CRAmed.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821267/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation.","authors":"Yifan Jiang, Disen Liao, Qiyun Zhu, Yang Young Lu","doi":"10.1093/bioinformatics/btaf014","DOIUrl":"10.1093/bioinformatics/btaf014","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.</p><p><strong>Results: </strong>Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.</p><p><strong>Availability and implementation: </strong>The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11849959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network.","authors":"Yujian Lee, Peng Gao, Yongqi Xu, Ziyang Wang, Shuaicheng Li, Jiaxing Chen","doi":"10.1093/bioinformatics/btaf032","DOIUrl":"10.1093/bioinformatics/btaf032","url":null,"abstract":"<p><strong>Motivation: </strong>The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data.</p><p><strong>Results: </strong>We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods.</p><p><strong>Availability and implementation: </strong>The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11810639/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multidimensional scaling improves distance-based clustering for microbiome data.","authors":"Guanhua Chen, Xinyue Wang, Qiang Sun, Zheng-Zheng Tang","doi":"10.1093/bioinformatics/btaf042","DOIUrl":"10.1093/bioinformatics/btaf042","url":null,"abstract":"<p><strong>Motivation: </strong>Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.</p><p><strong>Results: </strong>We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.</p><p><strong>Availability and implementation: </strong>The R package MDSMClust is available at https://github.com/wxy929/MDS-project.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trajectory Inference with Cell-Cell Interactions (TICCI): intercellular communication improves the accuracy of trajectory inference methods.","authors":"Yifeng Fu, Hong Qu, Dacheng Qu, Min Zhao","doi":"10.1093/bioinformatics/btaf027","DOIUrl":"10.1093/bioinformatics/btaf027","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding cell differentiation and development dynamics is key for single-cell transcriptome analysis. Current cell differentiation trajectory inference algorithms face challenges such as high dimensionality, noise, and a need for users to possess certain biological information about the datasets to effectively utilize the algorithms. Here, we introduce Trajectory Inference with Cell-Cell Interaction (TICCI), a novel way to address these challenges by integrating intercellular communication information. In recognizing crucial intercellular communication during development, TICCI proposes Cell-Cell Interactions (CCI) at single-cell resolution. We posit that cells exhibiting higher gene expression similarity patterns are more likely to exchange information via biomolecular mediators.</p><p><strong>Results: </strong>TICCI is initiated by constructing a cell-neighborhood matrix using edge weights composed of intercellular similarity and CCI information. Louvain partitioning identifies trajectory branches, attenuating noise, while single-cell entropy (scEntropy) is used to assess differentiation status. The Chu-Liu algorithm constructs a directed least-square model to identify trajectory branches, and an improved diffusion fitted time algorithm computes cell-fitted time in nonconnected topologies. TICCI validation on single-cell RNA sequencing (scRNA-seq) datasets confirms the accuracy of cell trajectories, aligning with genealogical branching and gene markers. Verification using extrinsic information labels demonstrates CCI information utility in enhancing accurate trajectory inference. A comparative analysis establishes TICCI proficiency in accurate temporal ordering.</p><p><strong>Availability and implementation: </strong>Source code and binaries freely available for download at https://github.com/mine41/TICCI, implemented in R (version 4.32) and Python (version 3.7.16) and supported on MS Windows. Authors ensure that the software is available for a full two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves
{"title":"ESPClust: unsupervised identification of modifiers for the effect size profile in omics association studies.","authors":"Francisco J Pérez-Reche, Nathan J Cheetham, Ruth C E Bowyer, Ellen J Thompson, Francesca Tettamanzi, Cristina Menni, Claire J Steves","doi":"10.1093/bioinformatics/btaf065","DOIUrl":"10.1093/bioinformatics/btaf065","url":null,"abstract":"<p><strong>Motivation: </strong>High-throughput omics technologies have revolutionized the identification of associations between individual traits and underlying biological characteristics, but still use 'one effect-size fits all' approaches. While covariates are often used, their potential as effect modifiers often remains unexplored.</p><p><strong>Results: </strong>We propose ESPClust, a novel unsupervised method designed to identify covariates that modify the effect size of associations between sets of omics variables and outcomes. By extending the concept of moderators to encompass multiple exposures, ESPClust analyses the effect size profile (ESP) to identify regions in covariate space with different ESP, enabling the discovery of subpopulations with distinct associations. Applying ESPClust to synthetic data, insulin resistance and COVID-19 symptom manifestation, we demonstrate its versatility and ability to uncover nuanced effect size modifications that traditional analyses may overlook. By integrating information from multiple exposures, ESPClust identifies effect size modifiers in datasets that are too small for traditional univariate stratified analyses. This method provides a robust framework for understanding complex omics data and holds promise for personalised medicine.</p><p><strong>Availability and implementation: </strong>The source code ESPClust is available at https://github.com/fjpreche/ESPClust.git. It can be installed via Python package repositories as 'pip install ESPClust==1.1.0'.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879214/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H McCormick, Jeffrey T Leek
{"title":"ipd: an R package for conducting inference on predicted data.","authors":"Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H McCormick, Jeffrey T Leek","doi":"10.1093/bioinformatics/btaf055","DOIUrl":"10.1093/bioinformatics/btaf055","url":null,"abstract":"<p><strong>Summary: </strong>ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning prediction algorithm. The package implements several recent proposed methods for inference on predicted data with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage.</p><p><strong>Availability: </strong>ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage 'vignette' are available at github.com/ipd-tools/ipd.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842045/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz
{"title":"Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.","authors":"Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz","doi":"10.1093/bioinformatics/btaf034","DOIUrl":"10.1093/bioinformatics/btaf034","url":null,"abstract":"<p><strong>Motivation: </strong>The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model.</p><p><strong>Results: </strong>Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.</p><p><strong>Availability and implementation: </strong>The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814488/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}