{"title":"Clustering scRNA-seq data with the cross-view collaborative information fusion strategy.","authors":"Zhengzheng Lou, Xiaojiao Wei, Yuanhao Hu, Shizhe Hu, Yucong Wu, Zhen Tian","doi":"10.1093/bib/bbae511","DOIUrl":"https://doi.org/10.1093/bib/bbae511","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) technology has revolutionized biological research by enabling high-throughput, cellular-resolution gene expression profiling. A critical step in scRNA-seq data analysis is cell clustering, which supports downstream analyses. However, the high-dimensional and sparse nature of scRNA-seq data poses significant challenges to existing clustering methods. Furthermore, integrating gene expression information with potential cell structure data remains largely unexplored. Here, we present scCFIB, a novel information bottleneck (IB)-based clustering algorithm that leverages the power of IB for efficient processing of high-dimensional sparse data and incorporates a cross-view fusion strategy to achieve robust cell clustering. scCFIB constructs a multi-feature space by establishing two distinct views from the original features. We then formulate the cell clustering problem as a target loss function within the IB framework, employing a collaborative information fusion strategy. To further optimize scCFIB's performance, we introduce a novel sequential optimization approach through an iterative process. Benchmarking against established methods on diverse scRNA-seq datasets demonstrates that scCFIB achieves superior performance in scRNA-seq data clustering tasks. Availability: the source code is publicly available on GitHub: https://github.com/weixiaojiao/scCFIB.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11473192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiabei Cheng, Xiaoyong Pan, Yi Fang, Kaiyuan Yang, Yiming Xue, Qingran Yan, Ye Yuan
{"title":"GexMolGen: cross-modal generation of hit-like molecules via large language model encoding of gene expression signatures.","authors":"Jiabei Cheng, Xiaoyong Pan, Yi Fang, Kaiyuan Yang, Yiming Xue, Qingran Yan, Ye Yuan","doi":"10.1093/bib/bbae525","DOIUrl":"10.1093/bib/bbae525","url":null,"abstract":"<p><p>Designing de novo molecules with specific biological activity is an essential task since it holds the potential to bypass the exploration of target genes, which is an initial step in the modern drug discovery paradigm. However, traditional methods mainly screen molecules by comparing the desired molecular effects within the documented experimental results. The data set limits this process, and it is hard to conduct direct cross-modal comparisons. Therefore, we propose a solution based on cross-modal generation called GexMolGen (Gene Expression-based Molecule Generator), which generates hit-like molecules using gene expression signatures alone. These signatures are calculated by inputting control and desired gene expression states. Our model GexMolGen adopts a \"first-align-then-generate\" strategy, aligning the gene expression signatures and molecules within a mapping space, ensuring a smooth cross-modal transition. The transformed molecular embeddings are then decoded into molecular graphs. In addition, we employ an advanced single-cell large language model for input flexibility and pre-train a scaffold-based molecular model to ensure that all generated molecules are 100% valid. Empirical results show that our model can produce molecules highly similar to known references, whether feeding in- or out-of-domain transcriptome data. Furthermore, it can also serve as a reliable tool for cross-modal screening.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11514063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142520981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cell cycle expression heterogeneity predicts degree of differentiation.","authors":"Kathleen Noller, Patrick Cahan","doi":"10.1093/bib/bbae536","DOIUrl":"10.1093/bib/bbae536","url":null,"abstract":"<p><p>Methods that predict fate potential or degree of differentiation from transcriptomic data have identified rare progenitor populations and uncovered developmental regulatory mechanisms. However, some state-of-the-art methods are too computationally burdensome for emerging large-scale data and all methods make inaccurate predictions in certain biological systems. We developed a method in R (stemFinder) that predicts single cell differentiation time based on heterogeneity in cell cycle gene expression. Our method is computationally tractable and is as good as or superior to competitors. As part of our benchmarking, we implemented four different performance metrics to assist potential users in selecting the tool that is most apt for their application. Finally, we explore the relationship between differentiation time and cell fate potential by analyzing a lineage tracing dataset with clonally labelled hematopoietic cells, revealing that metrics of differentiation time are correlated with the number of downstream lineages.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500603/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guo Wei, Nannan Wu, Kunyang Zhao, Sihai Yang, Long Wang, Yan Liu
{"title":"DeepCheck: multitask learning aids in assessing microbial genome quality.","authors":"Guo Wei, Nannan Wu, Kunyang Zhao, Sihai Yang, Long Wang, Yan Liu","doi":"10.1093/bib/bbae539","DOIUrl":"https://doi.org/10.1093/bib/bbae539","url":null,"abstract":"<p><p>Metagenomic analyses facilitate the exploration of the microbial world, advancing our understanding of microbial roles in ecological and biological processes. A pivotal aspect of metagenomic analysis involves assessing the quality of metagenome-assembled genomes (MAGs), crucial for accurate biological insights. Current machine learning-based methods often treat completeness and contamination prediction as separate tasks, overlooking their inherent relationship and limiting models' generalization. In this study, we present DeepCheck, a multitasking deep learning framework for simultaneous prediction of MAG completeness and contamination. DeepCheck consistently outperforms existing tools in accuracy across various experimental settings and demonstrates comparable speed while maintaining high predictive accuracy even for new lineages. Additionally, we employ interpretable machine learning techniques to identify specific genes and pathways that drive the model's predictions, enabling independent investigation and assessment of these biological elements for deeper insights.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495869/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust variable selection methods with Cox model-a selective practical benchmark study.","authors":"Yunwei Zhang, Samuel Muller","doi":"10.1093/bib/bbae508","DOIUrl":"10.1093/bib/bbae508","url":null,"abstract":"<p><p>With the advancement of biological and medical techniques, we can now obtain large amounts of high-dimensional omics data with censored survival information. This presents challenges in method development across various domains, particularly in variable selection. Given the inherently skewed distribution of the survival time outcome variable, robust variable selection methods offer potential solutions. Recently, there has been a focus on extending robust variable selection methods from linear regression models to survival models. However, despite these developments, robust methods are currently rarely used in practical applications, possibly due to a limited appreciation of their overall good performance. To address this gap, we conduct a selective review comparing the variable selection performance of twelve robust and non-robust penalised Cox models. Our study reveals the intricate relationship among covariates, survival outcomes, and modeling approaches, demonstrating how subtle variations can significantly impact the performance of methods considered. Based on our empirical research, we recommend the use of robust Cox models for variable selection in practice based on their superior performance in presence of outliers while maintaining good efficiency and accuracy when there are no outliers. This study provides valuable insights for method development and application, contributing to a better understanding of the relationship between correlated covariates and censored outcomes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472364/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Konstantinovsky, Ayelet Peres, Pazit Polak, Gur Yaari
{"title":"An unbiased comparison of immunoglobulin sequence aligners.","authors":"Thomas Konstantinovsky, Ayelet Peres, Pazit Polak, Gur Yaari","doi":"10.1093/bib/bbae556","DOIUrl":"https://doi.org/10.1093/bib/bbae556","url":null,"abstract":"<p><p>Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) is critical for our understanding of the adaptive immune system's dynamics in health and disease. Reliable analysis of AIRR-seq data depends on accurate rearranged immunoglobulin (Ig) sequence alignment. Various Ig sequence aligners exist, but there is no unified benchmarking standard representing the complexities of AIRR-seq data, obscuring objective comparisons of aligners across tasks. Here, we introduce GenAIRR, a modular simulation framework for generating Ig sequences alongside their ground truths. GenAIRR realistically simulates the intricacies of V(D)J recombination, somatic hypermutation, and an array of sequence corruptions. We comprehensively assessed prominent Ig sequence aligners across various metrics, unveiling unique performance characteristics for each aligner. The GenAIRR-produced datasets, combined with the proposed rigorous evaluation criteria, establish a solid basis for unbiased benchmarking of immunogenetics computational tools. It sets up the ground for further improving the crucial task of Ig sequence alignment, ultimately enhancing our understanding of adaptive immunity.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142567260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VISTA: an integrated framework for structural variant discovery","authors":"Varuni Sarwal, Seungmo Lee, Jianzhi Yang, Sriram Sankararaman, Mark Chaisson, Eleazar Eskin, Serghei Mangul","doi":"10.1093/bib/bbae462","DOIUrl":"https://doi.org/10.1093/bib/bbae462","url":null,"abstract":"Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn’s disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":9.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel systems biology experimental pipeline reveals matairesinol’s antimetastatic potential in prostate cancer: an integrated approach of network pharmacology, bioinformatics, and experimental validation","authors":"Rama Rajadnya, Nidhi Sharma, Akanksha Mahajan, Amrita Ulhe, Rajesh Patil, Mahabaleshwar Hegde, Aniket Mali","doi":"10.1093/bib/bbae466","DOIUrl":"https://doi.org/10.1093/bib/bbae466","url":null,"abstract":"Matairesinol (MAT), a plant lignan renowned for its anticancer properties in hormone-sensitive cancers like breast and prostate cancers, presents a promising yet underexplored avenue in the treatment of metastatic prostate cancer (mPC). To elucidate its specific therapeutic targets and mechanisms, our study adopted an integrative approach, amalgamating network pharmacology (NP), bioinformatics, GeneMANIA-based functional association (GMFA), and experimental validation. By mining online databases, we identified 27 common targets of mPC and MAT, constructing a MAT-mPC protein–protein interaction network via STRING and pinpointing 11 hub targets such as EGFR, AKT1, ERBB2, MET, IGF1, CASP3, HSP90AA1, HIF1A, MMP2, HGF, and MMP9 with CytoHuba. Utilizing DAVID, Gene Ontology (GO) analysis highlighted metastasis-related processes such as epithelial–mesenchymal transition, positive regulation of cell migration, and key Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, including cancer, prostate cancer, PI3K-Akt, and MAPK signaling, while the web resources such as UALCAN and GEPIA2 affirmed the clinical significance of the top 11 hub targets in mPC patient survival analysis and gene expression patterns. Our innovative GMFA enrichment method further enriched network pharmacology findings. Molecular docking analyses demonstrated substantial interactions between MAT and 11 hub targets. Simulation studies confirmed the stable interactions of MAT with selected targets. Experimental validation in PC3 cells, employing quantitative real-time reverse-transcription PCR and various cell-based assays, corroborated MAT’s antimetastatic effects on mPC. Thus, this exhaustive NP analysis, complemented by GMFA, molecular docking, molecular dynamics simulations, and experimental validations, underscores MAT’s multifaceted role in targeting mPC through diverse therapeutic avenues. Nevertheless, comprehensive in vitro validation is imperative to solidify these findings.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":9.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ao Shen, Mingzhi Yuan, Yingfan Ma, Jie Du, Manning Wang
{"title":"PGBind: pocket-guided explicit attention learning for protein–ligand docking","authors":"Ao Shen, Mingzhi Yuan, Yingfan Ma, Jie Du, Manning Wang","doi":"10.1093/bib/bbae455","DOIUrl":"https://doi.org/10.1093/bib/bbae455","url":null,"abstract":"As more and more protein structures are discovered, blind protein–ligand docking will play an important role in drug discovery because it can predict protein–ligand complex conformation without pocket information on the target proteins. Recently, deep learning-based methods have made significant advancements in blind protein–ligand docking, but their protein features are suboptimal because they do not fully consider the difference between potential pocket regions and non-pocket regions in protein feature extraction. In this work, we propose a pocket-guided strategy for guiding the ligand to dock to potential docking regions on a protein. To this end, we design a plug-and-play module to enhance the protein features, which can be directly incorporated into existing deep learning-based blind docking methods. The proposed module first estimates potential pocket regions on the target protein and then leverages a pocket-guided attention mechanism to enhance the protein features. Experiments are conducted on integrating our method with EquiBind and FABind, and the results show that their blind-docking performances are both significantly improved and new start-of-the-art performance is achieved by integration with FABind.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":9.5,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meisam Yousefi, Wayne Ren See, Kam Leng Aw-Yong, Wai Suet Lee, Cythia Lingli Yong, Felic Fanusi, Gavin J D Smith, Eng Eong Ooi, Shang Li, Sujoy Ghosh, Yaw Shin Ooi
{"title":"GeneRaMeN enables integration, comparison, and meta-analysis of multiple ranked gene lists to identify consensus, unique, and correlated genes","authors":"Meisam Yousefi, Wayne Ren See, Kam Leng Aw-Yong, Wai Suet Lee, Cythia Lingli Yong, Felic Fanusi, Gavin J D Smith, Eng Eong Ooi, Shang Li, Sujoy Ghosh, Yaw Shin Ooi","doi":"10.1093/bib/bbae452","DOIUrl":"https://doi.org/10.1093/bib/bbae452","url":null,"abstract":"High-throughput experiments often produce ranked gene outputs, with forward genetic screening being a notable example. While there are various tools for analyzing individual datasets, those that perform comparative and meta-analytical examination of such ranked gene lists remain scarce. Here, we introduce Gene Rank Meta Analyzer (GeneRaMeN), an R Shiny tool utilizing rank statistics to facilitate the identification of consensus, unique, and correlated genes across multiple hit lists. We focused on two key topics to showcase GeneRaMeN: virus host factors and cancer dependencies. Using GeneRaMeN ‘Rank Aggregation’, we integrated 24 published and new flavivirus genetic screening datasets, including dengue, Japanese encephalitis, and Zika viruses. This meta-analysis yielded a consensus list of flavivirus host factors, elucidating the significant influence of cell line selection on screening outcomes. Similar analysis on 13 SARS-CoV-2 CRISPR screening datasets highlighted the pivotal role of meta-analysis in revealing redundant biological pathways exploited by the virus to enter human cells. Such redundancy was further underscored using GeneRaMeN’s ‘Rank Correlation’, where a strong negative correlation was observed for host factors implicated in one entry pathway versus the alternate route. Utilizing GeneRaMeN’s ‘Rank Uniqueness’, we analyzed human coronaviruses 229E, OC43, and SARS-CoV-2 datasets, identifying host factors uniquely associated with a defined subset of the screening datasets. Similar analyses were performed on over 1000 Cancer Dependency Map (DepMap) datasets spanning 19 human cancer types to reveal unique cancer vulnerabilities for each organ/tissue. GeneRaMeN, an efficient tool to integrate and maximize the usability of genetic screening datasets, is freely accessible via https://ysolab.shinyapps.io/GeneRaMeN.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":9.5,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}