{"title":"Benchmarking large language models for genomic knowledge with GeneTuring.","authors":"Xinyi Shang, Xu Liao, Zhicheng Ji, Wenpin Hou","doi":"10.1093/bib/bbaf492","DOIUrl":"10.1093/bib/bbaf492","url":null,"abstract":"<p><p>Large language models (LLMs) show promise in biomedical research, but their effectiveness for genomic inquiry remains unclear. We developed GeneTuring, a benchmark consisting of 16 genomics tasks with 1600 curated questions, and manually evaluated 48 000 answers from 10 LLM configurations, including GPT-4o (via API, ChatGPT with web access, and a custom Generative Pretrained Transformer (GPT) setup), GPT-3.5, Claude 3.5, Gemini Advanced, GeneGPT (both slim and full), BioGPT, and BioMedLM. A custom GPT-4o configuration integrated with National Center for Biotechnology Information (NCBI) Application Programming Interfaces (APIs), developed in this study as SeqSnap, achieved the best overall performance. GPT-4o with web access and GeneGPT demonstrated complementary strengths. Our findings highlight both the promise and current limitations of LLMs in genomics, and emphasize the value of combining LLMs with domain-specific tools for robust genomic intelligence. GeneTuring offers a key resource for benchmarking and improving LLMs in biomedical research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12454257/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145124077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DA-HGL: a domain-augmented heterogeneous graph learning framework for protein function prediction.","authors":"Sai Hu, Wei Zhang, Bihai Zhao","doi":"10.1093/bib/bbaf511","DOIUrl":"10.1093/bib/bbaf511","url":null,"abstract":"<p><p>Accurate protein function prediction is critical for deciphering disease mechanisms and advancing precision medicine, yet remains challenging for proteins with sparse annotations. Traditional methods struggle with annotation sparsity and fail to integrate multimodal data holistically. We propose DA-HGL, a heterogeneous graph learning framework that integrates protein sequences, domain architectures, and Gene Ontology (GO) hierarchies through a multilayered graph and non-negative matrix factorization with dual biological constraints. DA-HGL uniquely models domain-function coherence, GO semantic consistency, and topological congruence. Evaluated on yeast and human proteomes, DA-HGL achieves Fmax gains of 9.0% (yeast CC) and 17.2% (human BP) over state-of-the-art methods. By dynamically learning domain-context associations and resolving annotation sparsity, DA-HGL excels in cold-start scenarios and disease-specific predictions (e.g. Parkinson's \"ubiquitin-dependent catabolism\"). This framework offers a robust tool for accelerating functional genomics and precision medicine. Code/data: https://github.com/husaiccsu/DA-HGL.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12476837/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145184520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutong Li, Tianlong Kuang, Tao Xu, Hanxiao Du, Yi Zhang, Yu Qian, Yiwen Chen, Zhenxian Xiao, Chen Chen, Jing Wu, Wen-Hong Zhang, Chenqi Lu, Ning Jiang
{"title":"TELLBASE: a novel tool of TELL-seq barcode-assisted scaffold assembler for bacterial genomes.","authors":"Yutong Li, Tianlong Kuang, Tao Xu, Hanxiao Du, Yi Zhang, Yu Qian, Yiwen Chen, Zhenxian Xiao, Chen Chen, Jing Wu, Wen-Hong Zhang, Chenqi Lu, Ning Jiang","doi":"10.1093/bib/bbaf504","DOIUrl":"10.1093/bib/bbaf504","url":null,"abstract":"<p><p>Transposase enzyme linked long-read sequencing (TELL-seq) technology generates barcode-linked reads, facilitating whole-genome sequencing (WGS), and complete assembly with improved accuracy and reduced costs. Unlike mate-pair sequencing technology, TELL-seq employs a near-full-sequence tagging strategy that allows more efficient capture of comprehensive genomic information. However, assembly algorithms and software capable of fully leveraging the characteristics of TELL-seq technology to effectively assemble genomic sequences at the megabase-scale are lacking, particularly for bacteria and their plasmids. In this study, we present TELL-seq barcode-assisted scaffold assembler (TELLBASE), a de novo genome assembler designed specifically for assembling bacterial genomes using TELL-seq-derived linked reads. In assembly tests involving bacteria such as Acinetobacter baumannii, Klebsiella pneumoniae, Mycobacterium tuberculosis, and Staphylococcus aureus, TELLBASE exhibited exceptional efficacy in producing chromosome-level bacterial genomic sequences and successful identification of plasmids present in the sequenced strains. Comparative analysis revealed that TELLBASE significantly outperforms existing assemblers tailored for TELL-seq-derived linked reads, such as TuringAssembler and Ariadne, in terms of the completeness and accuracy of the assembled genomes. Therefore, TELLBASE shows promising potential for refining draft bacterial genomes and further applications in related fields. The package for TELLBASE is freely available on GitHub (https://github.com/sosie1/TELLBASE).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12476840/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145184570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MCAMEF-BERT: an efficient deep learning method for RNA N7-methylguanosine site prediction via multi-branch feature integration.","authors":"Junlei Yu, Wenjia Gao, Siqi Chen, Ronglin Lu, Jianbo Qiao, Junru Jin, Leyi Wei, Hua Shi, Zilong Zhang, Feifei Cui, Xinbo Jiang, Zhongmin Yan","doi":"10.1093/bib/bbaf447","DOIUrl":"10.1093/bib/bbaf447","url":null,"abstract":"<p><p>Accurate identification of N7-methylguanosine (m7G) modification sites plays a critical role in uncovering the regulatory mechanisms of various biological processes, including human development, tumor initiation, and progression. However, existing prediction methods still suffer from limited representational power, redundant feature fusion, insufficient utilization of biological prior knowledge, and poor interpretability. In this study, we propose a novel deep learning model named MCAMEF-BERT. This model adopts a parallel architecture that integrates both a DNABERT-2-based pretrained model branch and multiple traditional feature encoding branches, enabling comprehensive multi-perspective sequence feature extraction. To address the redundancy issue in feature fusion, we introduce a multi-channel attention module. Our model demonstrates superior accuracy and effectiveness on datasets from m7GHub, outperforming other state-of-the-art classifiers. Furthermore, we validate the interpretability of MCAMEF-BERT through in silico saturation mutagenesis experiments, and confirm its robustness in motif recognition. Moreover, its generalization capability is validated across diverse RNA modification site prediction tasks.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12400811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144943426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GCNMF-SDA: predicting snoRNA-disease associations based on graph convolution and non-negative matrix factorization.","authors":"Yaowu Zhang, Xiu Jin, Xiaodan Zhang","doi":"10.1093/bib/bbaf453","DOIUrl":"10.1093/bib/bbaf453","url":null,"abstract":"<p><p>Small nucleolar RNAs (snoRNAs) play crucial roles in a wide range of biological processes, and studying their association with diseases can enhance our understanding of disease pathogenesis. Nevertheless, current knowledge of these associations is limited traditional biological experiments are both costly and time-consuming. Consequently, developing efficient computational methods is essential for predicting potential snoRNA-disease associations. We propose a novel prediction method based on non-negative matrix factorization and graph convolution for predicting snoRNA-disease associations (GCNMF-SDA). First, five different types of similarity information from snoRNA and disease entities are introduced to fully mine and refine the feature information. Then the snoRNA and disease similarity networks are integrated using nonlinearity approach Similarity Network Fusion (SNF), while the weighted K nearest known neighbors (WKNKN) algorithm is applied to optimize the snoRNA-disease association matrix. Following this, the graph convolution module and the non-negative matrix factorization module extract disease features and snoRNA features, respectively. After extracting these features, they are combined into a composite feature vector for each snoRNA-disease pair. Finally, the composite feature vectors along with their corresponding labels, are input into a multilayer perceptron for training. Our experiments, conducted using a rigorous five-fold cross-validation approach, reveal that the GCNMF-SDA model achieves an impressive area under the receiver operating characteristic curve (AUC-ROC) of 0.9659 and an area under the precision-recall curve (AUC-PR) of 0.9522. Furthermore, most of the novel associations identified by GCNMF-SDA were validated through case studies, underscoring the method's reliability in predicting potential relationships between snoRNAs and diseases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12409419/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144991122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"stImage: a versatile framework for optimizing spatial transcriptomic analysis through customizable deep histology and location informed integration.","authors":"Yu Wang, Haichun Yang, Ruining Deng, Yuankai Huo, Qi Liu, Yu Shyr, Shilin Zhao","doi":"10.1093/bib/bbaf429","DOIUrl":"10.1093/bib/bbaf429","url":null,"abstract":"<p><p>Spatial transcriptomics (ST) integrates gene expression data with the spatial organization of cells and their associated histology, offering unprecedented insights into tissue biology. While existing methods incorporate either location-based or histology-informed information, none fully synergize gene expression, histological features, and precise spatial coordinates within a unified framework. Moreover, these methods often exhibit inconsistent performance across diverse datasets and conditions. Here, we introduce stImage, an open-source R package that provides a comprehensive and flexible solution for ST analysis. By generating deep learning-derived histology features and offering 54 integrative strategies, stImage seamlessly combines transcriptional profiles, histology images, and spatial information. We demonstrate stImage's effectiveness across multiple datasets, underscoring its ability to guide users toward the most suitable integration strategy using diagnostic graph. Our results highlight how stImage can optimize ST, consistently improving biological insights and advancing our understanding of tissue architecture. stImage is freely available at https://github.com/YuWang-VUMC/stImage.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12409783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144991322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based deep learning for integrating single-cell and bulk transcriptomic data to identify clinical cancer subtypes.","authors":"Yixin Liu, Dandan Zhang, Tianyu Liu, Ao Wang, Guohua Wang, Yuming Zhao","doi":"10.1093/bib/bbaf467","DOIUrl":"10.1093/bib/bbaf467","url":null,"abstract":"<p><p>The integration of single-cell RNA sequencing (scRNA-seq) and bulk transcriptomic data has become essential for deciphering the complex heterogeneity of cancer and identifying clinical cancer subtypes. However, the inherent challenges posed by the high dimensionality, sparsity, and noise characteristics of scRNA-seq data have significantly hindered its widespread clinical translation. To address these limitations, we introduce single-cell and bulk transcriptomic graph deep learning, a graph-based deep learning method that synergistically integrates scRNA-seq and bulk transcriptomic data to precisely identify cancer subtypes and predict clinical outcomes. scBGDL constructs sample-specific gene graphs modeling complex gene-gene interactions and cellular relationships. The architecture employs Graph Attention Networks for feature aggregation, MinCutPool layers for dimensionality reduction, and Transformer modules to capture high-order biological dependencies. Independently validated in each of 16 distinct The Cancer Genome Atlas cancer types, scBGDL significantly outperformed existing methods in prognostic accuracy (mean C-index: 0.7060 versus 0.6709 max competitor), demonstrating robustness and generalizability to diverse transcriptional architectures. To demonstrate clinical versatility, we further evaluated scBGDL in three therapeutic contexts using multicenter cohorts: lung adenocarcinoma survival prediction (n = 1099), epithelial ovarian cancer platinum-based chemotherapy response (n = 762), skin cutaneous melanoma immunotherapy outcome (n = 305). scBGDL consistently delivered robust risk stratification (log-rank P < 0.05 across cohorts), identified key driver edges, and uncovered clinically relevant biological interpretations. By enabling multimodal data integration and interpretable biological insights, scBGDL advances precision oncology for prognosis prediction, therapy optimization, and biomarker discovery. The source code for scBGDL model is available online (https://github.com/NEFLab/scBGDL).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12423395/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145085109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comprehensive comparison on clustering methods for multi-slice spatially resolved transcriptomics data analysis.","authors":"Caiwei Xiong, Shuai Huang, Muqing Zhou, Yiyan Zhang, Wenrong Wu, Xihao Li, Huaxiu Yao, Jiawen Chen, Yun Li","doi":"10.1093/bib/bbaf471","DOIUrl":"10.1093/bib/bbaf471","url":null,"abstract":"<p><p>Spatial transcriptomics (ST) data, by providing spatial information, enable simultaneous analysis of gene expression distributions and their spatial patterns within tissue. Clustering or spatial domain detection represents an essential methodology for ST data, facilitating the exploration of spatial organizations with shared gene expression or histological characteristics. Traditionally, clustering algorithms for ST have focused on individual tissue sections. However, the emergence of numerous contiguous tissue sections derived from the same or similar tissue specimens within or across individuals has led to the development of multi-slice clustering methods. In this study, we assess seven single-slice and four multi-slice clustering methods on two simulated datasets and four real datasets. Additionally, we investigate the effectiveness of preprocessing techniques, including spatial coordinate alignment (e.g. PASTE) and gene expression batch effect removal (e.g. Harmony), on clustering performance. Our study provides a comprehensive comparison of clustering methods for multi-slice ST data, serving as a practical guide for method selection in various scenarios.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449087/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145085139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Artificial intelligence for comprehensive DNA methylation analysis: overview, challenges, and future directions.","authors":"Aymane Aghziel, Mohamed Adnane Mahraz, Hamid Tairi, Noura Aherrahrou","doi":"10.1093/bib/bbaf468","DOIUrl":"10.1093/bib/bbaf468","url":null,"abstract":"<p><p>This paper offers a comprehensive review of the synergy between artificial intelligence and DNA methylation analysis, encompassing machine learning, deep learning, natural language processing, and explainable artificial intelligence. In this study, we also highlighted the underexplored potential of signal processing and large language models-based models in DNA methylation research. Additionally, we discussed the challenges and limitations faced when managing and analyzing large and complex DNA methylation datasets. Furthermore, this article tries to shed light on the continuing evolution of this field and on the possible directions for future research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145085096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Li, Hongqiang Lyu, Kexin Li, Yuan Liu, Xinman Zhang, Ze Liu, Pengcheng Jing, Peng Han
{"title":"IMATAC imputes single-cell ATAC-seq data by deep hierarchical network with denoising autoencoder.","authors":"Yao Li, Hongqiang Lyu, Kexin Li, Yuan Liu, Xinman Zhang, Ze Liu, Pengcheng Jing, Peng Han","doi":"10.1093/bib/bbaf515","DOIUrl":"10.1093/bib/bbaf515","url":null,"abstract":"<p><p>Single-cell ATAC-seq (scATAC-seq) technology allows the interrogation of chromatin accessibility of individual cells. Dropout events occur while the sequencing data signals at some bona fide chromatin sites of individuals are not captured, and the curse of these dropouts in scATAC-seq data inevitably hinders downstream analysis. It remains a challenge to impute scATAC-seq data due to its high dimensionality, sparsity, and near-binarization properties. Herein, we propose IMATAC, a deep hierarchical network with denoising autoencoder for imputing scATAC-seq data in the form of peak by cell. The network embeds scATAC-seq data into a latent space by a deep hierarchical architecture at two different levels, including bottom level for local details and top level for global information, that helps to characterize the high-dimensional sparse scATAC-seq data. Besides, it is encouraged to learn to reconstruct the original scATAC-seq data from an artificially corrupted version through a denoising autoencoder, so as to acquire an ability to recover the missing values primarily relying on the cells under the same population with the help of a parallel multi-classifier. Using simulated and experimental data, the performance of IMATAC is demonstrated by a comparative analysis with the other competing methods. The results suggest that our method can achieve lower imputation errors, and benefit the downstream analysis, including heterogeneous clustering, differential analysis, and regulatory element discovery. Besides, the contributions of several important network modules in our IMATAC are investigated, and how well it can separate the dropout zeros from biological zeros are discussed.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12478030/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145184459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}