{"title":"STCGAN: a novel cycle-consistent generative adversarial network for spatial transcriptomics cellular deconvolution.","authors":"Bo Wang, Yahui Long, Yuting Bai, Jiawei Luo, Chee Keong Kwoh","doi":"10.1093/bib/bbae670","DOIUrl":"10.1093/bib/bbae670","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies have revolutionized our ability to map gene expression patterns within native tissue context, providing unprecedented insights into tissue architecture and cellular heterogeneity. However, accurately deconvolving cell-type compositions from ST spots remains challenging due to the sparse and averaged nature of ST data, which is essential for accurately depicting tissue architecture. While numerous computational methods have been developed for cell-type deconvolution and spatial distribution reconstruction, most fail to capture tissue complexity at the single-cell level, thereby limiting their applicability in practical scenarios.</p><p><strong>Results: </strong>To this end, we propose a novel cycle-consistent generative adversarial network named STCGAN for cellular deconvolution in spatial transcriptomic. STCGAN first employs a cycle-consistent generative adversarial network (CGAN) to pre-train on ST data, ensuring that both the mapping from ST data to latent space and its reverse mapping are consistent, capturing complex spatial gene expression patterns and learning robust latent representations. Based on the learned representation, STCGAN then optimizes a trainable cell-to-spot mapping matrix to integrate scRNA-seq data with ST data, accurately estimating cellular composition within each capture spot and effectively reconstructing the spatial distribution of cells across the tissue. To further enhance deconvolution accuracy, we incorporate spatial-aware regularization that ensures accurate cellular distribution reconstruction within the spatial context. Benchmarking against seven state-of-the-art methods on five simulated and real datasets from various tissues, STCGAN consistently delivers superior cell-type deconvolution performance.</p><p><strong>Availability: </strong>The code of STCGAN can be downloaded from https://github.com/cs-wangbo/STCGAN and all the mentioned datasets are available on Zenodo at https://zenodo.org/doi/10.5281/zenodo.10799113.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11666287/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142880673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.","authors":"Olivier Cinquin","doi":"10.1093/bib/bbaf045","DOIUrl":"10.1093/bib/bbaf045","url":null,"abstract":"<p><p>Large language models (LLMs) leverage factual knowledge from pretraining. Yet this knowledge remains incomplete and sometimes challenging to retrieve-especially in scientific domains not extensively covered in pretraining datasets and where information is still evolving. Here, we focus on genomics and bioinformatics. We confirm and expand upon issues with plain ChatGPT functioning as a bioinformatics assistant. Poor data retrieval and hallucination lead ChatGPT to err, as do incorrect sequence manipulations. To address this, we propose a system basing LLM outputs on up-to-date, authoritative facts and facilitating LLM-guided data analysis. Specifically, we introduce NagGPT, a middleware tool to insert between LLMs and databases, designed to bridge gaps in LLM knowledge and usage of database application programming interfaces. NagGPT proxies LLM-generated database queries, with special handling of incorrect queries. It acts as a gatekeeper between query responses and the LLM prompt, redirecting large responses to files but providing a synthesized snippet and injecting comments to steer the LLM. A companion OpenAI custom GPT, Genomics Fetcher-Analyzer, connects ChatGPT with NagGPT. It steers ChatGPT to generate and run Python code, performing bioinformatics tasks on data dynamically retrieved from a dozen common genomics databases (e.g. NCBI, Ensembl, UniProt, WormBase, and FlyBase). We implement partial mitigations for encountered challenges: detrimental interactions between code generation style and data analysis, confusion between database identifiers, and hallucination of both data and actions taken. Our results identify avenues to augment ChatGPT as a bioinformatics assistant and, more broadly, to improve factual accuracy and instruction following of unmodified LLMs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11798674/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143254680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inferring tumor purity using multi-omics data based on a uniform machine learning framework MoTP.","authors":"Qiqi Lu, Zhixian Liu, Xiaosheng Wang","doi":"10.1093/bib/bbaf056","DOIUrl":"10.1093/bib/bbaf056","url":null,"abstract":"<p><p>Existing algorithms for assessing tumor purity are limited to a single omics data, such as gene expression, somatic copy number variations, somatic mutations, and DNA methylation. Here we proposed the machine learning Multi-omics Tumor Purity prediction (MoTP) algorithm to estimate tumor purity based on multiple types of omics data. MoTP utilizes the Bayesian Regularized Neural Networks as the prediction algorithm, and Consensus Tumor Purity Estimates as labels. We trained MoTP using multi-omics data (mRNA, microRNA, long non-coding RNA, and DNA methylation) across 21 TCGA solid cancer types. By testing MoTP in TCGA validation sets, TCGA test sets, and eight datasets outside the TCGA cancer cohorts, we showed that although MoTP could achieve excellent performance in predicting tumor purity based on a single omics data type, the integration of multiple single omics data-based predictions can enhance the prediction performance. Moreover, we demonstrated the robustness of MoTP by testing it in datasets with Gaussian noise and feature missing. Benchmark analysis showed that MoTP outperformed most established tumor purity prediction algorithms, and that it required less running time and computational resource to fulfill the predictive task. Thus, MoTP would be an attractive option for computational tumor purity inference.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11826339/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143413486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A versatile pipeline to identify convergently lost ancestral conserved fragments associated with convergent evolution of vocal learning.","authors":"Xiaoyi Li, Kangli Zhu, Ying Zhen","doi":"10.1093/bib/bbae614","DOIUrl":"10.1093/bib/bbae614","url":null,"abstract":"<p><p>Molecular convergence in convergently evolved lineages provides valuable insights into the shared genetic basis of converged phenotypes. However, most methods are limited to coding regions, overlooking the potential contribution of regulatory regions. We focused on the independently evolved vocal learning ability in multiple avian lineages, and developed a whole-genome-alignment-free approach to identify genome-wide Convergently Lost Ancestral Conserved fragments (CLACs) in these lineages, encompassing noncoding regions. We discovered 2711 CLACs that are overrepresented in noncoding regions. Proximal genes of these CLACs exhibit significant enrichment in neurological pathways, including glutamate receptor signaling pathway and axon guidance pathway. Moreover, their expression is highly enriched in brain tissues associated with speech formation. Notably, several have known functions in speech and language learning, including ROBO family, SLIT2, GRIN1, and GRIN2B. Additionally, we found significantly enriched motifs in noncoding CLACs, which match binding motifs of transcriptional factors involved in neurogenesis and gene expression regulation in brain. Furthermore, we discovered 19 candidate genes that harbor CLACs in both human and multiple avian vocal learning lineages, suggesting their potential contribution to the independent evolution of vocal learning in both birds and humans.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586126/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ornit Nahman, Timothy J Few-Cooper, Shai S Shen-Orr
{"title":"Cell-specific priors rescue differential gene expression in spatial spot-based technologies.","authors":"Ornit Nahman, Timothy J Few-Cooper, Shai S Shen-Orr","doi":"10.1093/bib/bbae621","DOIUrl":"10.1093/bib/bbae621","url":null,"abstract":"<p><p>Spatial transcriptomics (ST), a breakthrough technology, captures the complex structure and state of tissues through the spatial profiling of gene expression. A variety of ST technologies have now emerged, most prominently spot-based platforms such as Visium. Despite the widespread use of ST and its distinct data characteristics, the vast majority of studies continue to analyze ST data using algorithms originally designed for older technologies such as single-cell (SC) and bulk RNA-seq-particularly when identifying differentially expressed genes (DEGs). However, it remains unclear whether these algorithms are still valid or appropriate for ST data. Therefore, here, we sought to characterize the performance of these methods by constructing an in silico simulator of ST data with a controllable and known DEG ground truth. Surprisingly, our findings reveal little variation in the performance of classic DEG algorithms-all of which fail to accurately recapture known DEGs to significant levels. We further demonstrate that cellular heterogeneity within spots is a primary cause of this poor performance and propose a simple gene-selection scheme, based on prior knowledge of cell-type specificity, to overcome this. Notably, our approach outperforms existing data-driven methods designed specifically for ST data and offers improved DEG recovery and reliability rates. In summary, our work details a conceptual framework that can be used upstream, agnostically, of any DEG algorithm to improve the accuracy of ST analysis and any downstream findings.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11647270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junwei Luo, Jiaojiao Wang, Jingjing Wei, Chaokun Yan, Huimin Luo
{"title":"DeepHapNet: a haplotype assembly method based on RetNet and deep spectral clustering.","authors":"Junwei Luo, Jiaojiao Wang, Jingjing Wei, Chaokun Yan, Huimin Luo","doi":"10.1093/bib/bbae656","DOIUrl":"10.1093/bib/bbae656","url":null,"abstract":"<p><p>Gene polymorphism originates from single-nucleotide polymorphisms (SNPs), and the analysis and study of SNPs are of great significance in the field of biogenetics. The haplotype, which consists of the sequence of SNP loci, carries more genetic information than a single SNP. Haplotype assembly plays a significant role in understanding gene function, diagnosing complex diseases, and pinpointing species genes. We propose a novel method, DeepHapNet, for haplotype assembly through the clustering of reads and learning correlations between read pairs. We employ a sequence model called Retentive Network (RetNet), which utilizes a multiscale retention mechanism to extract read features and learn the global relationships among them. Based on the feature representation of reads learned from the RetNet model, the clustering process of reads is implemented using the SpectralNet model, and, finally, haplotypes are constructed based on the read clusters. Experiments with simulated and real datasets show that the method performs well in the haplotype assembly problem of diploid and polyploid based on either long or short reads. The code implementation of DeepHapNet and the processing scripts for experimental data are publicly available at https://github.com/wjj6666/DeepHapNet.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11652615/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142845785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diffusion model assisted designing self-assembling collagen mimetic peptides as biocompatible materials.","authors":"Xinglong Wang, Kangjie Xu, Lingling Ma, Ruoxi Sun, Kun Wang, Ruiyan Wang, Junli Zhang, Wenwen Tao, Kai Linghu, Shuyao Yu, Jingwen Zhou","doi":"10.1093/bib/bbae622","DOIUrl":"10.1093/bib/bbae622","url":null,"abstract":"<p><p>Collagen self-assembly supports its mechanical function, but controlling collagen mimetic peptides (CMPs) to self-assemble into higher-order oligomers with numerous functions remains challenging due to the vast potential amino acid sequence space. Herein, we developed a diffusion model to learn features from different types of human collagens and generate CMPs; obtaining 66% of synthetic CMPs could self-assemble into triple helices. Triple-helical and untwisting states were probed by melting temperature (Tm); hence, we developed a model to predict collagen Tm, achieving a state-of-art Pearson's correlation (PC) of 0.95 by cross-validation and a PC of 0.8 for predicting Tm values of synthetic CMPs. Our chemically synthesized short CMPs and recombinantly expressed long CMPs could self-assemble, with the lowest requirement for hydrogel formation at a concentration of 0.08% (w/v). Five CMPs could promote osteoblast differentiation. Our results demonstrated the potential for using computer-aided methods to design functional self-assembling CMPs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11650526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142833838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comprehensive bioinformatics and machine learning analyses for breast cancer staging using TCGA dataset.","authors":"Saurav Chandra Das, Wahia Tasnim, Humayan Kabir Rana, Uzzal Kumar Acharjee, Md Manowarul Islam, Rabea Khatun","doi":"10.1093/bib/bbae628","DOIUrl":"10.1093/bib/bbae628","url":null,"abstract":"<p><p>Breast cancer is an alarming global health concern, including a vast and varied set of illnesses with different molecular characteristics. The fusion of sophisticated computational methodologies with extensive biological datasets has emerged as an effective strategy for unravelling complex patterns in cancer oncology. This research delves into breast cancer staging, classification, and diagnosis by leveraging the comprehensive dataset provided by the The Cancer Genome Atlas (TCGA). By integrating advanced machine learning algorithms with bioinformatics analysis, it introduces a cutting-edge methodology for identifying complex molecular signatures associated with different subtypes and stages of breast cancer. This study utilizes TCGA gene expression data to detect and categorize breast cancer through the application of machine learning and systems biology techniques. Researchers identified differentially expressed genes in breast cancer and analyzed them using signaling pathways, protein-protein interactions, and regulatory networks to uncover potential therapeutic targets. The study also highlights the roles of specific proteins (MYH2, MYL1, MYL2, MYH7) and microRNAs (such as hsa-let-7d-5p) that are the potential biomarkers in cancer progression founded on several analyses. In terms of diagnostic accuracy for cancer staging, the random forest method achieved 97.19%, while the XGBoost algorithm attained 95.23%. Bioinformatics and machine learning meet in this study to find potential biomarkers that influence the progression of breast cancer. The combination of sophisticated analytical methods and extensive genomic datasets presents a promising path for expanding our understanding and enhancing clinical outcomes in identifying and categorizing this intricate illness.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630003/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142827380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification.","authors":"Alida Zárate, Lorena Díaz-González, Blanca Taboada","doi":"10.1093/bib/bbaf001","DOIUrl":"10.1093/bib/bbaf001","url":null,"abstract":"<p><p>This study addresses the challenging task of identifying viruses within metagenomic data, which encompasses a broad array of biological samples, including animal reservoirs, environmental sources, and the human body. Traditional methods for virus identification often face limitations due to the diversity and rapid evolution of viral genomes. In response, recent efforts have focused on leveraging artificial intelligence (AI) techniques to enhance accuracy and efficiency in virus detection. However, existing AI-based approaches are primarily binary classifiers, lacking specificity in identifying viral types and reliant on nucleotide sequences. To address these limitations, VirDetect-AI, a novel tool specifically designed for the identification of eukaryotic viruses within metagenomic datasets, is introduced. The VirDetect-AI model employs a combination of convolutional neural networks and residual neural networks to effectively extract hierarchical features and detailed patterns from complex amino acid genomic data. The results demonstrated that the model has outstanding results in all metrics, with a sensitivity of 0.97, a precision of 0.98, and an F1-score of 0.98. VirDetect-AI improves our comprehension of viral ecology and can accurately classify metagenomic sequences into 980 viral protein classes, hence enabling the identification of new viruses. These classes encompass an extensive array of viral genera and families, as well as protein functions and hosts.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11729733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142977613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordi Martorell-Marugán, Raúl López-Domínguez, Juan Antonio Villatoro-García, Daniel Toro-Domínguez, Marco Chierici, Giuseppe Jurman, Pedro Carmona-Sáez
{"title":"Explainable deep neural networks for predicting sample phenotypes from single-cell transcriptomics.","authors":"Jordi Martorell-Marugán, Raúl López-Domínguez, Juan Antonio Villatoro-García, Daniel Toro-Domínguez, Marco Chierici, Giuseppe Jurman, Pedro Carmona-Sáez","doi":"10.1093/bib/bbae673","DOIUrl":"10.1093/bib/bbae673","url":null,"abstract":"<p><p>Recent advances in single-cell RNA-Sequencing (scRNA-Seq) technologies have revolutionized our ability to gather molecular insights into different phenotypes at the level of individual cells. The analysis of the resulting data poses significant challenges, and proper statistical methods are required to analyze and extract information from scRNA-Seq datasets. Sample classification based on gene expression data has proven effective and valuable for precision medicine applications. However, standard classification schemas are often not suitable for scRNA-Seq due to their unique characteristics, and new algorithms are required to effectively analyze and classify samples at the single-cell level. Furthermore, existing methods for this purpose have limitations in their usability. Those reasons motivated us to develop singleDeep, an end-to-end pipeline that streamlines the analysis of scRNA-Seq data training deep neural networks, enabling robust prediction and characterization of sample phenotypes. We used singleDeep to make predictions on scRNA-Seq datasets from different conditions, including systemic lupus erythematosus, Alzheimer's disease and coronavirus disease 2019. Our results demonstrate strong diagnostic performance, validated both internally and externally. Moreover, singleDeep outperformed traditional machine learning methods and alternative single-cell approaches. In addition to prediction accuracy, singleDeep provides valuable insights into cell types and gene importance estimation for phenotypic characterization. This functionality provided additional and valuable information in our use cases. For instance, we corroborated that some interferon signature genes are consistently relevant for autoimmunity across all immune cell types in lupus. On the other hand, we discovered that genes linked to dementia have relevant roles in specific brain cell populations, such as APOE in astrocytes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11735047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143000431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}