BMC Bioinformatics最新文献

Accurate human genome analysis with element avidity sequencing. 精确的人类基因组分析与元素贪婪测序。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-25 DOI: 10.1186/s12859-025-06191-4

Andrew Carroll, Alexey Kolesnikov, Daniel E Cook, Lucas Brambrink, Kelly N Wiseman, Sophie M Billings, Semyon Kruglyak, Bryan R Lajoie, Junhua Zhao, Shawn E Levy, Cory Y McLean, Kishwar Shafin, Maria Nattestad, Pi-Chuan Chang

{"title":"Accurate human genome analysis with element avidity sequencing.","authors":"Andrew Carroll, Alexey Kolesnikov, Daniel E Cook, Lucas Brambrink, Kelly N Wiseman, Sophie M Billings, Semyon Kruglyak, Bryan R Lajoie, Junhua Zhao, Shawn E Levy, Cory Y McLean, Kishwar Shafin, Maria Nattestad, Pi-Chuan Chang","doi":"10.1186/s12859-025-06191-4","DOIUrl":"https://doi.org/10.1186/s12859-025-06191-4","url":null,"abstract":"Background: New sequencing technologies provide options for the scientific community to design studies and build clinical workflows. These options expand user choice, and can enable more accurate, scalable, or affordable workflows depending on the fit between scientist needs and platform capability. However, it is essential to understand the performance of these new technologies for different tasks, especially for capabilities that were not possible or tractable in prior technologies. We investigate the new sequencing technology avidity from Element Biosciences. to help the scientific community understand the performance of the options to generate sequencing data.Results: We show that Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger differences at lower coverages (20-30x). We quantify base error rates of Element reads, finding lower error rates, especially in homopolymer and tandem repeat regions. We use Element's ability to generate paired end sequencing with longer insert sizes than typical short-read sequencing. We show that longer insert sizes result in even higher accuracy, with long insert Element sequencing giving more accurate genome analyses at all coverages.Conclusions: New options for sequencing technologies can analyze genomes comparably or better than prior standard methods.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"194"},"PeriodicalIF":2.9,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144717378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Soft graph clustering for single-cell RNA sequencing data. 单细胞RNA测序数据的软图聚类。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-25 DOI: 10.1186/s12859-025-06231-z

Ping Xu, Pengfei Wang, Zhiyuan Ning, Meng Xiao, Min Wu, Yuanchun Zhou

{"title":"Soft graph clustering for single-cell RNA sequencing data.","authors":"Ping Xu, Pengfei Wang, Zhiyuan Ning, Meng Xiao, Min Wu, Yuanchun Zhou","doi":"10.1186/s12859-025-06231-z","DOIUrl":"https://doi.org/10.1186/s12859-025-06231-z","url":null,"abstract":"Background: Clustering analysis is fundamental in single-cell RNA sequencing (scRNA-seq) data analysis for elucidating cellular heterogeneity and diversity. Recent graph-based scRNA-seq clustering methods, particularly graph neural networks (GNNs), have significantly improved in tackling the challenges of high-dimension, high-sparsity, and frequent dropout events that lead to ambiguous cell population boundaries. However, one major challenge for GNN-based methods is their reliance on hard graph constructions derived from similarity matrices. These constructions introduce difficulties when applied to scRNA-seq data due to: (i) The simplification of intercellular relationships into binary edges (0 or 1) by applying thresholds, which restricts the capture of continuous similarity features among cells and leads to significant information loss. (ii) The presence of significant inter-cluster connections within hard graphs, which can confuse GNN methods that rely heavily on graph structures, potentially causing erroneous message propagation and biased clustering outcomes.Results: To tackle these challenges, we introduce scSGC, a Soft Graph Clustering for single-cell RNA sequencing data, which aims to more accurately characterize continuous similarities among cells through non-binary edge weights, thereby mitigating the limitations of rigid data structures. The scSGC framework comprises three core components: (i) a zero-inflated negative binomial (ZINB)-based feature autoencoder designed to effectively handle the sparsity and dropout issues in scRNA-seq data; (ii) a dual-channel cut-informed soft graph embedding module, constructed through deep graph-cut information, capturing continuous similarities between cells while preserving the intrinsic data structures of scRNA-seq; and (iii) an optimal transport-based clustering optimization module, achieving optimal delineation of cell populations while maintaining high biological relevance.Conclusion: By integrating dual-channel cut-informed soft graph representation learning, a ZINB-based feature autoencoder, and optimal transport-driven clustering optimization, scSGC effectively overcomes the challenges associated with traditional hard graph constructions in GNN methods. Extensive experiments across ten datasets demonstrate that scSGC outperforms 13 state-of-the-art clustering models in clustering accuracy, cell type annotation, and computational efficiency. These results highlight its substantial potential to advance scRNA-seq data analysis and deepen our understanding of cellular heterogeneity.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"195"},"PeriodicalIF":2.9,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144717379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combining whole genome sequencing and non-adaptive group testing for large-scale ethnicity screens. 结合全基因组测序和非适应性群体测试进行大规模种族筛选。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-24 DOI: 10.1186/s12859-025-06192-3

Elior Avraham, Noam Shental

引用次数: 0

Incorporating exon-exon junction reads enhances differential splicing detection. 结合外显子-外显子连接读取增强了差异剪接检测。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-24 DOI: 10.1186/s12859-025-06210-4

Mai T Pham, Michael J G Milevskiy, Jane E Visvader, Yunshun Chen

{"title":"Incorporating exon-exon junction reads enhances differential splicing detection.","authors":"Mai T Pham, Michael J G Milevskiy, Jane E Visvader, Yunshun Chen","doi":"10.1186/s12859-025-06210-4","DOIUrl":"https://doi.org/10.1186/s12859-025-06210-4","url":null,"abstract":"Background: RNA sequencing (RNA-seq) is a gold standard technology for studying gene and transcript expression. Different transcripts from the same gene are usually determined by varying combinations of exons within the gene, formed by splicing events. One method of studying differential alternative splicing between groups in short-read RNA-seq experiments is through differential exon usage (DEU) analysis, which uses exon-level read counts along with downstream statistical testing strategies. However, the standard exon counting method does not consider exon-junction information, which may reduce the statistical power in detecting splicing alterations.Results: We present a new workflow for differential splicing analysis, called differential exon-junction usage (DEJU). This DEJU analysis workflow adopts a new feature quantification approach that jointly summarises exon and exon-exon junction reads, which are then integrated into the established Rsubread-edgeR/limma frameworks. We performed comprehensive simulation studies to benchmark the performance of DEJU against existing methods. We also applied DEJU to a mouse mammary gland RNA-seq dataset, revealing biologically meaningful splicing events that could not be detected previously.Conclusions: We demonstrate that incorporating exon-exon junction reads significantly improves the detection of differential splicing events. The proposed DEJU workflow offers increased statistical power and computational efficiency compared to widely used existing approaches, while effectively controlling the false discovery rate.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"193"},"PeriodicalIF":2.9,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiffCoRank: a comprehensive framework for discovering hub genes and differential gene co-expression in brain implant-associated tissue responses. DiffCoRank：发现中枢基因和差异基因共表达在脑植入相关组织反应的综合框架。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-23 DOI: 10.1186/s12859-025-06232-y

Anirban Chakraborty, Erin K Purcell, Michael G Moore

{"title":"DiffCoRank: a comprehensive framework for discovering hub genes and differential gene co-expression in brain implant-associated tissue responses.","authors":"Anirban Chakraborty, Erin K Purcell, Michael G Moore","doi":"10.1186/s12859-025-06232-y","DOIUrl":"10.1186/s12859-025-06232-y","url":null,"abstract":"Background: Brain implants have significant potential for therapeutic applications and neuroscience research, but complex tissue responses often compromise their long-term stability. To address this challenge, differential coexpression analysis can be used to identify key molecular regulators involved in brain implant responses.Results: We developed DiffCoRank, an integrated framework that improves differential coexpression analysis by integrating the techniques of RNA-Seq data preprocessing, gene filtering, correlation-based module identification, and network analysis to discover differentially coexpressed gene clusters. A key innovation of our approach is false discovery rate (FDR) based selection of strongly connected genes (SCGs), by which we improve detection of strong coexpression patterns that otherwise could be lost to spurious correlations. To enhance the identification of different modules, we employ a hybrid clustering technique that combines uniform manifold approximation and projection (UMAP) with density-based spatial clustering of applications with noise (DBSCAN). We propose a multi-criteria hub gene ranking system incorporating network centrality metrics such as degree, closeness, betweenness, and eigenvector centrality to prioritise biologically relevant genes. Additionally, we created a user-friendly application to visualize and explore the results of DiffCoRank interactively.Conclusions: Our method successfully identified key gene modules involved in oxidative stress, calcium signaling, immunological regulation, autophagic recovery, and vascular remodeling in RNA-Seq data of implanted rat brain tissue. Furthermore, we compared our results to those of other existing coexpression analysis frameworks, showing that our method successfully identifies unique regulatory processes and consistent coexpression patterns. Our research offers novel insights into the molecular processes that explain implant-tissue interactions and possible approaches to improve the robustness and biocompatibility of brain interfaces.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"191"},"PeriodicalIF":2.9,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144697573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LDA-SCGB: inferring lncRNA-disease associations based on condensed gradient boosting. LDA-SCGB：基于凝聚梯度增强推断lncrna与疾病的关联。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-22 DOI: 10.1186/s12859-025-06169-2

Chengqiu Dai, Linna Wang, Yingwei Deng, Xuzhu Gao, Jingyu Zhang

{"title":"LDA-SCGB: inferring lncRNA-disease associations based on condensed gradient boosting.","authors":"Chengqiu Dai, Linna Wang, Yingwei Deng, Xuzhu Gao, Jingyu Zhang","doi":"10.1186/s12859-025-06169-2","DOIUrl":"https://doi.org/10.1186/s12859-025-06169-2","url":null,"abstract":"Background: Long non-coding RNAs (lncRNAs) play essential roles in various physiological and pathological processes. Inferring new lncRNA-disease associations (LDAs) not only promotes us to better understand these complex biological processes, but also provides new options for the diagnosis and prevention of diseases.Results: A novel computational model, LDA-SCGB, is proposed to predict new LDAs. LDA-SCGB first extracts features of each lncRNA-disease pair with singular value decomposition. Next, it classifies unknown lncRNA-disease pairs through the condensed gradient boosting model. The results demonstrated that LDA-SCGB greatly outperformed the other four representative LDA inference methods (SDLDA, LDNFSGB, LDAenDL and LDASR) under 5-fold cross validations on lncRNAs, diseases, and lncRNA-disease pairs on three LDA datasets, which were from lncRNADisease v2.0, MNDR, and lncRNADisease v3.0, respectively. LDA-SCGB was further used to find potential lncRNAs for colorectal cancer, heart failure, and lung adenocarcinoma. The results demonstrated that CCDC26, MIAT, and CCDC26 had higher association probability with colorectal cancer, heart failure, and lung adenocarcinoma, respectively.Conclusions: We foresee that LDA-SCGB was capable of predicting potential lncRNAs for complex diseases and further assisting in cancer diagnosis and therapy.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"190"},"PeriodicalIF":2.9,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144688798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BPFun: a deep learning framework for bioactive peptide function prediction using multi-label strategy by transformer-driven and sequence rich intrinsic information. BPFun：一个基于转换器驱动和序列丰富内在信息的多标签策略的生物活性肽功能预测深度学习框架。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-21 DOI: 10.1186/s12859-025-06190-5

Lun Zhu, Hao Sun, Sen Yang

{"title":"BPFun: a deep learning framework for bioactive peptide function prediction using multi-label strategy by transformer-driven and sequence rich intrinsic information.","authors":"Lun Zhu, Hao Sun, Sen Yang","doi":"10.1186/s12859-025-06190-5","DOIUrl":"10.1186/s12859-025-06190-5","url":null,"abstract":"Bioactive peptides are beneficial or have physiological effects on the life activities of biological organisms. The functions of bioactive peptides are diverse, usually with one or more, so accurately detecting the multiple functions of multi-functional peptides is extremely important. Traditional experimental identification methods are time-consuming, laborious and costly. To overcome these problems, we adopt a computational biology approach and propose a new model BPFun based on deep learning, which can predict seven functions including anticancer, antibacterial, antihypertensive and so on. In BPFun, we obtained the features of bioactive peptides from different aspects, including biological and physicochemical features. Meanwhile, adopting data augmentation to solve the problem of data imbalance. We combine convolutional networks of different scales and Bi-LSTM layers to obtain high-level feature vectors of different features. Finally, the prediction performance is improved by combining these fused features and combining the self-attention mechanism and the Bi-LSTM layer. Our experiments show that BPFun based on five types of sequence features significantly improves the prediction performance of bioactive peptides. Experiments on the test dataset showed that BPFun gets the accuracy and absolute truth value of 0.6577 and 0.6573 on the dataset of seven functional classifications and was superior to other methods. Codes and data are available at https://github.com/291357657/BPFun .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"187"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12278619/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144681864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

clonevdjseq: A workflow and bioinformatics management system for sequencing, archiving, and analysis of VDJ sequences from clonal libraries. clonevdjseq：用于克隆文库中VDJ序列测序、存档和分析的工作流和生物信息学管理系统。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-21 DOI: 10.1186/s12859-025-06107-2

Keith Mitchell, Samuel Hunter, Lutz Froenicke, Karl Murray, Matthew Settles, James S Trimmer

{"title":"clonevdjseq: A workflow and bioinformatics management system for sequencing, archiving, and analysis of VDJ sequences from clonal libraries.","authors":"Keith Mitchell, Samuel Hunter, Lutz Froenicke, Karl Murray, Matthew Settles, James S Trimmer","doi":"10.1186/s12859-025-06107-2","DOIUrl":"10.1186/s12859-025-06107-2","url":null,"abstract":"Background: Advances in next-generation sequencing technologies have facilitated extensive analysis of B cell and T cell receptor (BCR/TCR, respectively) sequences from monoclonal hybridoma libraries, single B cells, and single T cells, generating vast amounts of important data pertaining to antigen recognition. However, existing workflows and bioinformatics tools often lack the flexibility and scalability needed to handle large clonal level datasets effectively. An initial system and hybridoma dependent version of this code was distributed as part of the NeuroMabSeq publication, but clonevdjseq aims to be a technical addendum for broader system compatibility and enhanced modeling.Results: We present clonevdjseq, an integrated and accessible software solution leveraging nextflow and Django. Developed primarily for large hybridoma libraries, the workflow and pipeline is amenable to BCR/TCR sequence analysis of homogenous populations or clones of B and T cells, respectively. The clonevdjseq pipeline includes modules for read processing, amplicon denoising, and quality control of paired variable light/heavy chains of BCRs from B cells and hybridomas, or alpha(ɑ)/beta(β) and delta(δ)/gamma(γ) chains of TCRs in the case of T cell applications. The pipeline is built upon a robust, high-throughput library prep protocol, upon which processed data has been verified across thousands of monoclonal antibodies. The results of this effort has yielded sequences used to develop functional recombinant monoclonal antibodies and single chain variable fragments as a part of the NeuroMabSeq initiative where thousands of hybridoma samples were processed (Mitchell et al. in Sci Rep 13(1):16200, 2023) as well as provide additional modeling and extensibility to other modalities. The clonevdjseq software is accessible via Nextflow and also offers a database and web app as a final optional step in the processing for dissemination of results and data exploration.Conclusions: clonevdjseq offers a comprehensive and scalable solution for the processing and analysis of large monoclonal and oligoclonal VDJ datasets. Its modular design, dynamic pipeline, and robust database integration facilitate efficient data management and analysis. The platform is publicly available and aims to support the research community by providing an accessible and flexible tool for archiving and dissemination of BCR sequences from hybridomas, with applicability for other applications such as TCR sequences from single-cell T cell populations.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"186"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12278597/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144681913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

metaGEENOME: an integrated framework for differential abundance analysis of microbiome data in cross-sectional and longitudinal studies. metaGEENOME：在横断面和纵向研究中对微生物组数据进行差异丰度分析的集成框架。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-21 DOI: 10.1186/s12859-025-06217-x

Ahmed Abdelkader, Nur A Ferdous, Mohamed El-Hadidi, Tomasz Burzykowski, Mohamed Mysara

{"title":"metaGEENOME: an integrated framework for differential abundance analysis of microbiome data in cross-sectional and longitudinal studies.","authors":"Ahmed Abdelkader, Nur A Ferdous, Mohamed El-Hadidi, Tomasz Burzykowski, Mohamed Mysara","doi":"10.1186/s12859-025-06217-x","DOIUrl":"10.1186/s12859-025-06217-x","url":null,"abstract":"Background: Detecting biomarkers is a key objective in microbiome research, often done through 16S rRNA amplicon sequencing or shotgun metagenomic analysis. A critical step in this process is differential abundance (DA) analysis, which aims to pinpoint taxa whose abundance significantly differs between groups. However, DA analysis remains challenging due to high dimensionality, compositionality, sparsity, inter-taxa correlations, uneven abundance distributions, and missing values-all which hinder our ability to model the data accurately. Despite the availability of many DA tools, balancing high statistical power with effective false discovery rate (FDR) control remains a major limitation.Results: Here, we introduce a novel approach for DA analysis that integrates counts adjusted with Trimmed Mean of M-values (CTF) normalization and Centered Log Ratio (CLR) transformation with Generalized Estimating Equation (GEE) model. We benchmarked our approach against eight widely used tools employing both simulated and real datasets in cross-sectional and longitudinal settings. While several tools (e.g. MetagenomeSeq, edgeR, DESeq2 and Lefse) achieved high sensitivity, they often failed to adequately control the FDR. In contrast, our method demonstrated high sensitivity and specificity when compared to other approaches that successfully controlled the FDR, including ALDEx2, limma-voom, ANCOM, and ANCOM-BC2.Conclusions: Our approach effectively addresses key challenges in microbiome data analysis across both cross-sectional and longitudinal designs. Integrated into the R package metaGEENOME (https://github.com/M-Mysara/metaGEENOME), our framework provides a flexible, scalable and statistically robust solution for DA analysis, offering improved FDR control and enhanced performance for biomarker discovery in microbiome studies.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"189"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12281747/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144681914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Aryana-bs: context-aware alignment of bisulfite-sequencing reads. Aryana-bs：亚硫酸酯测序读取的上下文感知校准。

IF 2.9 3区生物学

BMC Bioinformatics Pub Date : 2025-07-21 DOI: 10.1186/s12859-025-06182-5

Hassan Nikaein, Ali Sharifi-Zarchi, Afsoon Afzal, Saeedeh Ezzati, Farzane Rasti, Hamidreza Chitsaz, Govindarajan Kunde-Ramamoorthy

{"title":"Aryana-bs: context-aware alignment of bisulfite-sequencing reads.","authors":"Hassan Nikaein, Ali Sharifi-Zarchi, Afsoon Afzal, Saeedeh Ezzati, Farzane Rasti, Hamidreza Chitsaz, Govindarajan Kunde-Ramamoorthy","doi":"10.1186/s12859-025-06182-5","DOIUrl":"10.1186/s12859-025-06182-5","url":null,"abstract":"Background: DNA methylation is essential in various biological processes, including imprinting, development, inflammation, and numerous disorders, such as cancer. Bisulfite sequencing (BS) serves as the gold standard for measuring DNA methylation at single-base resolution by converting unmethylated cytosines to thymines while leaving methylated cytosines intact. However, this C-to-T conversion presents a well-known challenge in conventional short-read aligners, which treat these conversions as substitutions. Many aligners that require seed sequences fail when frequent C-to-T conversions occur over short distances, resulting in reduced alignment accuracy. To address this challenge, two alignment methods have been well established: three-letter alignment and wildcard alignment. Three-letter alignment faces the significant issue of data loss by converting all thymines to cytosines, which obscures meaningful information. On the other hand, wildcard alignment introduces a biased alignment, failing to treat reads from unmethylated and methylated regions equally, leading to artifacts in methylation level estimation and inaccuracies in quantifying DNA methylation. This work introduces ARYANA-BS, a novel BS aligner that diverges from conventional DNA aligners by directly integrating BS-specific base alterations within its alignment engine. Leveraging known DNA methylation patterns across different genomic contexts, ARYANA-BS constructs five indexes from the reference genome, aligns each read to all indexes, and selects the alignment with the minimum penalty. To further refine alignment accuracy, an optional Expectation-Maximization (EM) step is incorporated, which integrates methylation probability information into the decision-making process for choosing the optimal index for each read. This approach aims to enhance BS read alignment accuracy by accommodating the complexities of DNA methylation patterns across diverse genomic contexts.Results: Experimental evaluations on both simulated and real data reveal that ARYANA-BS achieves state-of-the-art accuracy, maintaining competitive speed and memory efficiency.Conclusions: ARYANA-BS significantly improves alignment accuracy for bisulfite sequencing data by effectively integrating DNA methylation-specific alterations and genomic context. It outperforms existing methods, such as BSMAP, bwa-meth, Bismark, BSBolt, and abismal, particularly in robustness against genomic biases and alignment of longer, higher-error reads, demonstrating suitability for cancer research and cell-free DNA studies. While the Expectation-Maximization (EM) algorithm provides only modest initial improvements, it establishes a valuable framework for future refinement and potential enhancements in sensitive applications.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"188"},"PeriodicalIF":2.9,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12281798/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144681863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0