{"title":"Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective.","authors":"Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren","doi":"10.1093/bib/bbaf136","DOIUrl":"10.1093/bib/bbaf136","url":null,"abstract":"<p><p>The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. Despite this progress, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, and are often contaminated by noise and uncertainty, obscuring the underlying biological signal. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, metabolite levels, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering approaches struggle with the complexity of biological networks, while deep learning, with its ability to handle high-dimensional data and automatically identify meaningful patterns, has shown great promise in overcoming these challenges. Besides systematically reviewing the strengths and weaknesses of advanced deep learning methods, we have curated 21 datasets from nine benchmarks to evaluate the performance of 58 computational methods. Our analysis reveals that model performance can vary significantly across different benchmark datasets and evaluation metrics, providing a useful perspective for selecting the most appropriate approach based on a specific application scenario. We highlight three key areas for future development, offering valuable insights into how deep learning can be effectively applied to transcriptomic data analysis in biological, medical, and clinical settings.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11970898/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"scMUG: deep clustering analysis of single-cell RNA-seq data on multiple gene functional modules.","authors":"De-Min Liang, Pu-Feng Du","doi":"10.1093/bib/bbaf138","DOIUrl":"10.1093/bib/bbaf138","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity by providing gene expression data at the single-cell level. Unlike bulk RNA-seq, scRNA-seq allows identification of different cell types within a given tissue, leading to a more nuanced comprehension of cell functions. However, the analysis of scRNA-seq data presents challenges due to its sparsity and high dimensionality. Since bioinformatics plays an important role in the analysis of big data and its utility for the welfare of living beings, it has been widely applied in analyzing scRNA-seq data. To address these challenges, we introduce the scMUG computational pipeline, which incorporates gene functional module information to enhance scRNA-seq clustering analysis. The pipeline includes data preprocessing, cell representation generation, cell-cell similarity matrix construction, and clustering analysis. The scMUG pipeline also introduces a novel similarity measure that combines local density and global distribution in the latent cell representation space. As far as we can tell, this is the first attempt to integrate gene functional associations into scRNA-seq clustering analysis. We curated nine human scRNA-seq datasets to evaluate our scMUG pipeline. With the help of gene functional information and the novel similarity measure, the clustering results from scMUG pipeline present deep insights into functional relationships between gene expression patterns and cellular heterogeneity. In addition, our scMUG pipeline also presents comparable or better clustering performances than other state-of-the-art methods. All source codes of scMUG have been deposited in a GitHub repository with instructions for reproducing all results (https://github.com/degiminnal/scMUG).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972635/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143794592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HSSPPI: hierarchical and spatial-sequential modeling for PPIs prediction.","authors":"Yuguang Li, Zhen Tian, Xiaofei Nan, Shoutao Zhang, Qinglei Zhou, Shuai Lu","doi":"10.1093/bib/bbaf079","DOIUrl":"10.1093/bib/bbaf079","url":null,"abstract":"<p><strong>Motivation: </strong>Protein-protein interactions play a fundamental role in biological systems. Accurate detection of protein-protein interaction sites (PPIs) remains a challenge. And, the methods of PPIs prediction based on biological experiments are expensive. Recently, a lot of computation-based methods have been developed and made great progress. However, current computational methods only focus on one form of protein, using only protein spatial conformation or primary sequence. And, the protein's natural hierarchical structure is ignored.</p><p><strong>Results: </strong>In this study, we propose a novel network architecture, HSSPPI, through hierarchical and spatial-sequential modeling of protein for PPIs prediction. In this network, we represent protein as a hierarchical graph, in which a node in the protein is a residue (residue-level graph) and a node in the residue is an atom (atom-level graph). Moreover, we design a spatial-sequential block for capturing complex interaction relationships from spatial and sequential forms of protein. We evaluate HSSPPI on public benchmark datasets and the predicting results outperform the comparative models. This indicates the effectiveness of hierarchical protein modeling and also illustrates that HSSPPI has a strong feature extraction ability by considering spatial and sequential information simultaneously.</p><p><strong>Availability and implementation: </strong>The code of HSSPPI is available at https://github.com/biolushuai/Hierarchical-Spatial-Sequential-Modeling-of-Protein.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879409/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143555835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shrabanti Chowdhury, Sammy Ferri-Borgogno, Peng Yang, Wenyi Wang, Jie Peng, Samuel C Mok, Pei Wang
{"title":"Learning directed acyclic graphs for ligands and receptors based on spatially resolved transcriptomic data of ovarian cancer.","authors":"Shrabanti Chowdhury, Sammy Ferri-Borgogno, Peng Yang, Wenyi Wang, Jie Peng, Samuel C Mok, Pei Wang","doi":"10.1093/bib/bbaf085","DOIUrl":"10.1093/bib/bbaf085","url":null,"abstract":"<p><p>To unravel the mechanism of immune activation and suppression within tumors, a critical step is to identify transcriptional signals governing cell-cell communication between tumor and immune/stromal cells in the tumor microenvironment. Central to this communication are interactions between secreted ligands and cell-surface receptors, creating a highly connected signaling network among cells. Recent advancements in in situ-omics profiling, particularly spatial transcriptomic (ST) technology, provide unique opportunities to directly characterize ligand-receptor signaling networks that power cell-cell communication. In this paper, we propose a novel statistical method, LRnetST, to characterize the ligand-receptor interaction networks between adjacent tumor and immune/stroma cells based on ST data. LRnetST utilizes a directed acyclic graph model with a novel approach to handle the zero-inflated distributions of ST data. It also leverages existing ligand-receptor regulation databases as prior information, and employs a bootstrap aggregation strategy to achieve robust network estimation. Application of LRnetST to ST data of high-grade serous ovarian tumor samples revealed both common and distinct ligand-receptor regulations across different tumors. Some of these interactions were validated through both a MERFISH dataset and a CosMx SMI dataset of independent ovarian tumor samples. These results cast light on biological processes relating to the communication between tumor and immune/stromal cells in ovarian tumors. An open-source R package of LRnetST is available on GitHub at https://github.com/jie108/LRnetST.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891659/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TopoQA: a topological deep learning-based approach for protein complex structure interface quality assessment.","authors":"Bingqing Han, Yipeng Zhang, Longlong Li, Xinqi Gong, Kelin Xia","doi":"10.1093/bib/bbaf083","DOIUrl":"10.1093/bib/bbaf083","url":null,"abstract":"<p><p>Even with the significant advances of AlphaFold-Multimer (AF-Multimer) and AlphaFold3 (AF3) in protein complex structure prediction, their accuracy is still not comparable with monomer structure prediction. Efficient and effective quality assessment (QA) or estimation of model accuracy models that can evaluate the quality of the predicted protein-complexes without knowing their native structures are of key importance for protein structure generation and model selection. In this paper, we leverage persistent homology (PH) to capture the atomic-level topological information around residues and design a topological deep learning-based QA method, TopoQA, to assess the accuracy of protein complex interfaces. We integrate PH from topological data analysis into graph neural networks (GNNs) to characterize complex higher-order structures that GNNs might overlook, enhancing the learning of the relationship between the topological structure of complex interfaces and quality scores. Our TopoQA model is extensively validated based on the two most-widely used benchmark datasets, Docking Benchmark5.5 AF2 (DBM55-AF2) and Heterodimer-AF2 (HAF2), along with our newly constructed ABAG-AF3 dataset to facilitate comparisons with AF3. For all three datasets, TopoQA outperforms AF-Multimer-based AF2Rank and shows an advantage over AF3 in nearly half of the targets. In particular, in the DBM55-AF2 dataset, a ranking loss of 73.6% lower than AF-Multimer-based AF2Rank is obtained. Further, other than AF-Multimer and AF3, we have also extensively compared with nearly-all the state-of-the-art models (as far as we know), it has been found that our TopoQA can achieve the highest Top 10 Hit-rate on the DBM55-AF2 dataset and the lowest ranking loss on the HAF2 dataset. Ablation experiments show that our topological features significantly improve the model's performance. At the same time, our method also provides a new paradigm for protein structure representation learning.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891663/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zou, Wenyi Zhang, Jun Hu, Xiaogen Zhou, Biao Zhang
{"title":"DockEM: an enhanced method for atomic-scale protein-ligand docking refinement leveraging low-to-medium resolution cryo-EM density maps.","authors":"Jing Zou, Wenyi Zhang, Jun Hu, Xiaogen Zhou, Biao Zhang","doi":"10.1093/bib/bbaf091","DOIUrl":"10.1093/bib/bbaf091","url":null,"abstract":"<p><p>Protein-ligand docking plays a pivotal role in virtual drug screening, and recent advancements in cryo-electron microscopy (cryo-EM) technology have significantly accelerated the progress of structure-based drug discovery. However, the majority of cryo-EM density maps are of medium to low resolution (3-10 Å), which presents challenges in effectively integrating cryo-EM data into molecular docking workflows. In this study, we present an updated protein-ligand docking method, DockEM, which leverages local cryo-EM density maps and physical energy refinement to precisely dock ligands into specific protein binding sites. Tested on a dataset of 121 protein-ligand compound, our results demonstrate that DockEM outperforms other advanced docking methods. The strength of DockEM lies in its ability to incorporate cryo-EM density map information, effectively leveraging the structural information of ligands embedded within these maps. This advancement enhances the use of cryo-EM density maps in virtual drug screening, offering a more reliable framework for drug discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891657/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adriano Fonzino, Pietro Luca Mazzacuva, Adam Handen, Domenico Alessandro Silvestris, Annette Arnold, Riccardo Pecori, Graziano Pesole, Ernesto Picardi
{"title":"REDInet: a temporal convolutional network-based classifier for A-to-I RNA editing detection harnessing million known events.","authors":"Adriano Fonzino, Pietro Luca Mazzacuva, Adam Handen, Domenico Alessandro Silvestris, Annette Arnold, Riccardo Pecori, Graziano Pesole, Ernesto Picardi","doi":"10.1093/bib/bbaf107","DOIUrl":"10.1093/bib/bbaf107","url":null,"abstract":"<p><p>A-to-I ribonucleic acid (RNA) editing detection is still a challenging task. Current bioinformatics tools rely on empirical filters and whole genome sequencing or whole exome sequencing data to remove background noise, sequencing errors, and artifacts. Sometimes they make use of cumbersome and time-consuming computational procedures. Here, we present REDInet, a temporal convolutional network-based deep learning algorithm, to profile RNA editing in human RNA sequencing (RNAseq) data. It has been trained on REDIportal RNA editing sites, the largest collection of human A-to-I changes from >8000 RNAseq data of the genotype-tissue expression project. REDInet can classify editing events with high accuracy harnessing RNAseq nucleotide frequencies of 101-base windows without the need for coupled genomic data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11924403/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143668919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuwen Wang, Zhili Chang, Yuqian Liu, Shenjie Wang, Xiaoyan Zhu, Yang Shao, Jiayin Wang
{"title":"EMcnv: enhancing CNV detection performance through ensemble strategies with heterogeneous meta-graph neural networks.","authors":"Xuwen Wang, Zhili Chang, Yuqian Liu, Shenjie Wang, Xiaoyan Zhu, Yang Shao, Jiayin Wang","doi":"10.1093/bib/bbaf135","DOIUrl":"10.1093/bib/bbaf135","url":null,"abstract":"<p><p>Copy number variation (CNV) is a crucial biomarker for many complex traits and diseases. Although numerous CNV detection tools are available, no single method consistently achieves optimal performance across diverse sequencing samples, as each tool has distinct advantages and limitations. Therefore, integrating the strengths of these tools to improve CNV detection accuracy is both a promising strategy and a significant challenge. To address this, we propose EMcnv, a novel deep ensemble framework based on meta-learning. EMcnv combines multiple CNV detection strategies through a three-step approach: (i) leveraging meta-learning and meta-path heterogeneous graphs, employing Relational Graph Convolutional Networks as a specific model within the Heterogeneous Graph Neural Networks framework to develop a probabilistic weight meta-model that ensembles various CNV detection strategies; (ii) assigning probabilistic weights to calls from different CNV detection tools and aggregating them into weighted CNV regions (CNVRs); (iii) refining Copy number variations based on weighted CNVRs. We conducted comprehensive experiments on both simulated and real sequencing data using benchmark datasets. The results demonstrate that EMcnv significantly outperforms popular existing methods, underscoring its superiority and importance in CNV detection. To support further research, the source code is available for academic use at https://github.com/Sherwin-xjtu/EMcnv.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11957260/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143751265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DS-MVP: identifying disease-specific pathogenicity of missense variants by pre-training representation.","authors":"Qiufeng Chen, Lijun Quan, Lexin Cao, Bei Zhang, Zhijun Zhang, Liangchen Peng, Junkai Wang, Yelu Jiang, Liangpeng Nie, Geng Li, Tingfang Wu, Qiang Lyu","doi":"10.1093/bib/bbaf119","DOIUrl":"10.1093/bib/bbaf119","url":null,"abstract":"<p><p>Accurately predicting the pathogenicity of missense variants is crucial for improving disease diagnosis and advancing clinical research. However, existing computational methods primarily focus on general pathogenicity predictions, overlooking assessments of disease-specific conditions. In this study, we propose DS-MVP, a method capable of predicting disease-specific pathogenicity of missense variants in human genomes. DS-MVP first leverages a deep learning model pre-trained on a large general pathogenicity dataset to learn rich representation of missense variants. It then fine-tunes these representations with an XGBoost model on smaller datasets for specific diseases. We evaluated the learned representation by testing it on multiple binary pathogenicity datasets and gene-level statistics, demonstrating that DS-MVP outperforms existing state-of-the-art methods, such as MetaRNN and AlphaMissense. Additionally, DS-MVP excels in multi-label and multi-class classification, effectively classifying disease-specific pathogenic missense variants based on disease conditions. It further enhances predictions by fine-tuning the pre-trained model on disease-specific datasets. Finally, we analyzed the contributions of the pre-trained model and various feature types, with gene description corpus features from large language model and genetic feature fusion contributing the most. These results underscore that DS-MVP represents a broader perspective on pathogenicity prediction and holds potential as an effective tool for disease diagnosis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11932084/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143699493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathaniel S O'Connell, Byron C Jaeger, Garrett S Bullock, Jaime Lynn Speiser
{"title":"A comparison of random forest variable selection methods for regression modeling of continuous outcomes.","authors":"Nathaniel S O'Connell, Byron C Jaeger, Garrett S Bullock, Jaime Lynn Speiser","doi":"10.1093/bib/bbaf096","DOIUrl":"10.1093/bib/bbaf096","url":null,"abstract":"<p><p>Random forest (RF) regression is popular machine learning method to develop prediction models for continuous outcomes. Variable selection, also known as feature selection or reduction, involves selecting a subset of predictor variables for modeling. Potential benefits of variable selection are methodologic (i.e. improving prediction accuracy and computational efficiency) and practical (i.e. reducing the burden of data collection and improving efficiency). Several variable selection methods leveraging RFs have been proposed, but there is limited evidence to guide decisions on which methods may be preferable for different types of datasets with continuous outcomes. Using 59 publicly available datasets in a benchmarking study, we evaluated the implementation of 13 RF variable selection methods. Performance of variable selection was measured via out-of-sample R2 of a RF that used the variables selected for each method. Simplicity of variable selection was measured via the percent reduction in the number of variables selected out of the number of variables available. Efficiency was measured via computational time required to complete the variable selection. Based on our benchmarking study, variable selection methods implemented in the Boruta and aorsf R packages selected the best subset of variables for axis-based RF models, whereas methods implemented in the aorsf R package selected the best subset of variables for oblique RF models. A significant contribution of this study is the ability to assess different variable selection methods in the setting of RF regression for continuous outcomes to identify preferable methods using an open science approach.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891652/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}