Jianfeng Sun, Jinlong Ru, Adam P Cribbs, Dapeng Xiong
{"title":"PyPropel: a Python-based tool for efficiently processing and characterising protein data.","authors":"Jianfeng Sun, Jinlong Ru, Adam P Cribbs, Dapeng Xiong","doi":"10.1186/s12859-025-06079-3","DOIUrl":"10.1186/s12859-025-06079-3","url":null,"abstract":"<p><strong>Background: </strong>The volume of protein sequence data has grown exponentially in recent years, driven by advancements in metagenomics. Despite this, a substantial proportion of these sequences remain poorly annotated, underscoring the need for robust bioinformatics tools to facilitate efficient characterisation and annotation for functional studies.</p><p><strong>Results: </strong>We present PyPropel, a Python-based computational tool developed to streamline the large-scale analysis of protein data, with a particular focus on applications in machine learning. PyPropel integrates sequence and structural data pre-processing, feature generation, and post-processing for model performance evaluation and visualisation, offering a comprehensive solution for handling complex protein datasets.</p><p><strong>Conclusion: </strong>PyPropel provides added value over existing tools by offering a unified workflow that encompasses the full spectrum of protein research, from raw data pre-processing to functional annotation and model performance analysis, thereby supporting efficient protein function studies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"70"},"PeriodicalIF":2.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871610/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143536374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing microbiome data with taxonomic misclassification using a zero-inflated Dirichlet-multinomial model.","authors":"Matthew D Koslovsky","doi":"10.1186/s12859-025-06078-4","DOIUrl":"10.1186/s12859-025-06078-4","url":null,"abstract":"<p><p>The human microbiome is the collection of microorganisms living on and inside of our bodies. A major aim of microbiome research is understanding the role microbial communities play in human health with the goal of designing personalized interventions that modulate the microbiome to treat or prevent disease. Microbiome data are challenging to analyze due to their high-dimensionality, overdispersion, and zero-inflation. Analysis is further complicated by the steps taken to collect and process microbiome samples. For example, sequencing instruments have a fixed capacity for the total number of reads delivered. It is therefore essential to treat microbial samples as compositional. Another complicating factor of modeling microbiome data is that taxa counts are subject to measurement error introduced at various stages of the measurement protocol. Advances in sequencing technology and preprocessing pipelines coupled with our growing knowledge of the human microbiome have reduced, but not eliminated, measurement error. Ignoring measurement error during analysis, though common in practice, can then lead to biased inference and curb reproducibility. We propose a Dirichlet-multinomial modeling framework for microbiome data with excess zeros and potential taxonomic misclassification. We demonstrate how accommodating taxonomic misclassification improves estimation performance and investigate differences in gut microbial composition between healthy and obese children.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"69"},"PeriodicalIF":2.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11869466/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143522470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João Capela, Maria Zimmermann-Kogadeeva, Aalt D J van Dijk, Dick de Ridder, Oscar Dias, Miguel Rocha
{"title":"Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction.","authors":"João Capela, Maria Zimmermann-Kogadeeva, Aalt D J van Dijk, Dick de Ridder, Oscar Dias, Miguel Rocha","doi":"10.1186/s12859-025-06081-9","DOIUrl":"10.1186/s12859-025-06081-9","url":null,"abstract":"<p><strong>Background: </strong>Protein large language models (LLM) have been used to extract representations of enzyme sequences to predict their function, which is encoded by enzyme commission (EC) numbers. However, a comprehensive comparison of different LLMs for this task is still lacking, leaving questions about their relative performance. Moreover, protein sequence alignments (e.g. BLASTp or DIAMOND) are often combined with machine learning models to assign EC numbers from homologous enzymes, thus compensating for the shortcomings of these models' predictions. In this context, LLMs and sequence alignment methods have not been extensively compared as individual predictors, raising unaddressed questions about LLMs' performance and limitations relative to the alignment methods. In this study, we set out to assess the performance of ESM2, ESM1b, and ProtBERT language models in their ability to predict EC numbers, comparing them with BLASTp, against each other and against models that rely on one-hot encodings of amino acid sequences.</p><p><strong>Results: </strong>Our findings reveal that combining these LLMs with fully connected neural networks surpasses the performance of deep learning models that rely on one-hot encodings. Moreover, although BLASTp provided marginally better results overall, DL models provide results that complement BLASTp's, revealing that LLMs better predict certain EC numbers while BLASTp excels in predicting others. The ESM2 stood out as the best model among the LLMs tested, providing more accurate predictions on difficult annotation tasks and for enzymes without homologs.</p><p><strong>Conclusions: </strong>Crucially, this study demonstrates that LLMs still have to be improved to become the gold standard tool over BLASTp in mainstream enzyme annotation routines. On the other hand, LLMs can provide good predictions for more difficult-to-annotate enzymes, particularly when the identity between the query sequence and the reference database falls below 25%. Our results reinforce the claim that BLASTp and LLM models complement each other and can be more effective when used together.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"68"},"PeriodicalIF":2.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866580/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143522475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining single-cell ATAC and RNA sequencing for supervised cell annotation.","authors":"Jaidip Gill, Abhijit Dasgupta, Brychan Manry, Natasha Markuzon","doi":"10.1186/s12859-025-06084-6","DOIUrl":"10.1186/s12859-025-06084-6","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell analysis offers insights into cellular heterogeneity and individual cell function. Cell type annotation is the first and critical step for performing such an analysis. Current methods mostly utilize single-cell RNA sequencing data. Several studies demonstrated improved unsupervised annotation when combining RNA with single-cell ATAC sequencing, but improvements in supervised methods have not been explored.</p><p><strong>Results: </strong>Single-cell 10x genomics multiome datasets containing paired ATAC and RNA from human peripheral blood mononuclear cells (PBMC) and neuronal cells with Alzheimer's Disease were used for supervised annotation. Using linear and nonlinear dimensionality reduction methods and random forest, support vector machine and logistic regression classification models, we demonstrate the improvement in supervised annotation and prediction confidence in PBMC data when using a combination of RNA seq and ATAC-seq data. No such improvement was observed when annotating neuronal cells. Specifically, F1 scores were improved when using scVI embeddings to annotate PBMC sub-types. CD4 T effector memory cells showed the largest improvement in F1 score.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"67"},"PeriodicalIF":2.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863512/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143514600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Wang, Hanghang Tong, Ziye Zhu, Fengzhen Hou, Yun Li
{"title":"Enhancing biomedical named entity recognition with parallel boundary detection and category classification.","authors":"Yu Wang, Hanghang Tong, Ziye Zhu, Fengzhen Hou, Yun Li","doi":"10.1186/s12859-025-06086-4","DOIUrl":"10.1186/s12859-025-06086-4","url":null,"abstract":"<p><strong>Background: </strong>Named entity recognition is a fundamental task in natural language processing. Recognizing entities in biomedical text, known as the BioNER, is particularly crucial for cutting-edge applications. However, BioNER poses greater challenges compared to traditional NER due to (1) nested structures and (2) category correlations inherent in biomedical entities. Recently, various BioNER models have been developed based on region classification or large language models. Despite being successful, these models still struggle to balance handling nested structures and capturing category knowledge.</p><p><strong>Results: </strong>We present a novel parallel BioNER model, BEAN, designed to address the unique properties of biomedical entities while achieving a reasonable balance between handling nested structures and incorporating category correlations. Extensive experiments on five public NER datasets, including four biomedical datasets, demonstrate that BEAN achieves state-of-the-art performance.</p><p><strong>Conclusions: </strong>The proposed BEAN is elaborately designed to achieve two key objectives of the BioNER task: clearly detecting entity boundaries and correctly classifying entity categories. It is the first BioNER model to handle nested structures and category correlations in parallel. We exploit head, tail, and contextualized features to efficiently detect entity boundaries via a triaffine model. To the best of our knowledge, we are the first to introduce a multi-label classification model for the BioNER task to extract entity category information without boundary guidance.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"63"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863403/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143498801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"scFTAT: a novel cell annotation method integrating FFT and transformer.","authors":"Binhua Tang, Yiyao Chen","doi":"10.1186/s12859-025-06061-z","DOIUrl":"10.1186/s12859-025-06061-z","url":null,"abstract":"<p><strong>Background: </strong>Advancements in high-throughput sequencing and deep learning have boosted single-cell RNA studies. However, current methods for annotating single-cell data face challenges due to high data sparsity and tedious manual annotation on large-scale data.</p><p><strong>Results: </strong>Thus, we proposed a novel annotation model integrating FFT (Fast Fourier Transform) and an enhanced Transformer, named scFTAT. Initially, it reduces data sparsity using LDA (Linear Discriminant Analysis). Subsequently, automatic cell annotation is achieved through a proposed module integrating FFT and an enhanced Transformer. Moreover, the model is fine-tuned to improve training performance by effectively incorporating such techniques as kernel approximation, position encoding enhancement, and attention enhancement modules. Compared to existing popular annotation tools, scFTAT maintains high accuracy and robustness on six typical datasets. Specifically, the model achieves an accuracy of 0.93 on the human kidney data, with an F1 score of 0.84, precision of 0.96, recall rate of 0.80, and Matthews correlation coefficient of 0.89. The highest accuracy of the compared methods is 0.92, with an F1 score of 0.71, precision of 0.75, recall rate of 0.73, and Matthews correlation coefficient of 0.85. The compiled codes and supplements are available at: https://github.com/gladex/scFTAT .</p><p><strong>Conclusion: </strong>In summary, the proposed scFTAT effectively integrates FFT and enhanced Transformer for automatic feature learning, addressing the challenges of high sparsity and tedious manual annotation in single-cell profiling data. Experiments on six typical scRNA-seq datasets from human and mouse tissues evaluate the model using five metrics as accuracy, F1 score, precision, recall, and Matthews correlation coefficient. Performance comparisons with existing methods further demonstrate the efficiency and robustness of our proposed method.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"62"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11853718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143490718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shakeel Ahmed, Syed Muhammad Zaigham Abbas Naqvi, Muhammad Awais, Yongzhe Ren, Hao Zhang, Junfeng Wu, Linze Li, Vijaya Raghavan, Jiandong Hu
{"title":"Bacterial network for precise plant stress detection and enhanced crop resilience.","authors":"Shakeel Ahmed, Syed Muhammad Zaigham Abbas Naqvi, Muhammad Awais, Yongzhe Ren, Hao Zhang, Junfeng Wu, Linze Li, Vijaya Raghavan, Jiandong Hu","doi":"10.1186/s12859-025-06082-8","DOIUrl":"10.1186/s12859-025-06082-8","url":null,"abstract":"<p><p>Understanding plant hormonal responses to stress and their transport dynamics remains challenging, limiting advancements in enhancing plant resilience. Our study presents a novel approach that utilizes genetically engineered bacteria (GEB) as molecular transceivers within plants, aiming to develop revolutionary agricultural biosensors. We focus on abscisic acid (ABA), a key hormone for plant growth and stress response. We propose using Escherichia coli (E. coli) engineered with PYR1-derived receptors that exhibit high affinity for ABA, triggering a bioluminescent response. Simulations investigate the detection time for ABA, bacterial diffusion within plant roots, advection effects through shoots, and chemotaxis in response to attractant gradients in leaves. Results indicate that higher ABA concentrations correlate with shorter response times, with an average of 431.52 s based on bioluminescence. The average internalization time for bacteria through a plant root area of 2 µm<sup>2</sup> during the rhizophagy process is estimated at 1220.12 s. Simulations also assess bacterial movement through shoots, the impact of advection, and chemotactic responses. These findings highlight the complex interplay between plant signaling and microbial communities, validating the efficacy of our bacterial-based sensor approach and opening new avenues for agricultural biosensor technology.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"64"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143498797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes
{"title":"GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs.","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"10.1186/s12859-025-06037-z","DOIUrl":"10.1186/s12859-025-06037-z","url":null,"abstract":"<p><strong>Background: </strong>Advances in high throughput sequencing technologies provide a huge number of genomes to be analyzed. Thus, computational methods play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations. However, this approach can be computationally expensive and restrictive in scenarios with large datasets.</p><p><strong>Results: </strong>We present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This study proposes GRAMEP, an alignment-free approach that adopts the principle of maximum entropy to discover the most informative k-mers specific to a genome or set of sequences under investigation. The informative k-mers enable the detection of variant-specific mutations in comparison to a reference genome or other set of sequences. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to methods with the same purpose.</p><p><strong>Conclusions: </strong>GRAMEP is an open and user-friendly software based on maximum entropy that provides an efficient alignment-free approach to identifying and classifying unique genomic subsequences and SNPs with high accuracy, offering advantages over comparative methods. The instructions for use, applicability, and usability of GRAMEP are open access at https://github.com/omatheuspimenta/GRAMEP .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"66"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863517/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143498806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sajid M Hossain, Yiyun Rao, Jahid O Hossain, Justin R Pritchard, Boyang Zhao
{"title":"goloco: a web application to create genome scale information from surprisingly small experiments.","authors":"Sajid M Hossain, Yiyun Rao, Jahid O Hossain, Justin R Pritchard, Boyang Zhao","doi":"10.1186/s12859-025-06070-y","DOIUrl":"10.1186/s12859-025-06070-y","url":null,"abstract":"<p><strong>Background: </strong>Functional genomics aims to decipher gene function by observing cellular changes when specific genes are disrupted using CRISPR technology. However, these experiments are limited by scalability, as comprehensive CRISPR screens require extensive resources, involving millions of cells and thousands of sgRNAs, making large-scale studies challenging. We propose a novel approach with \"CRISPR lossy compression\" to reduce the complexity of CRISPR screens by focusing on key genetic nodes that can infer genome-wide phenotypes. These condensed sets, comprising 100 to 1,000 genes, enable previously impractical genome-wide screens tractable.</p><p><strong>Results: </strong>To make this approach accessible to the wider scientific community, we developed goloco, an interactive web application that allows users to explore genome-scale loss-of-function phenotypes from as few as 100 pooled measurements. The tool is complemented by a wide array of analyses, including volcano plot visualizations, regression and network analyses.</p><p><strong>Conclusions: </strong>This tool goloco empowers researchers to conduct genome-scale functional studies with minimal experimental overhead, broadening the accessibility of large-scale functional genomics research.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"61"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11854281/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143490714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Miloudi, Aisha Al-Qahtani, Thamanna Hashir, Mohamed Chikri, Halima Bensmail
{"title":"Tisslet tissues-based learning estimation for transcriptomics.","authors":"Ahmed Miloudi, Aisha Al-Qahtani, Thamanna Hashir, Mohamed Chikri, Halima Bensmail","doi":"10.1186/s12859-024-06025-9","DOIUrl":"10.1186/s12859-024-06025-9","url":null,"abstract":"<p><p>In the context of multi-omics data analytics for various diseases, transcriptome-wide association studies leveraging genetically predicted gene expression hold promise for identifying novel regions linked to complex traits. However, existing methods for multi-tissue gene expression prediction often fail to account for tissue-tissue expression interactions, limiting their accuracy and effectiveness. This research addresses the challenge of predicting gene expression across multiple tissues by incorporating tissue-tissue expression correlations based on a nonlinear multivariate model. Our findings demonstrate that this model excels in estimating tissue-tissue interactions and accurately predicting missing data. These results have significant implications for multi-omics data analytics and transcriptome-wide association studies, suggesting a novel approach for identifying regions associated with complex traits.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"65"},"PeriodicalIF":2.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11863492/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143498813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}