nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2024-09-23 DOI:10.1093/bib/bbae477

Linjie Wang, Wei Li, Fanghui Zhou, Kun Yu, Chaolu Feng, Dazhe Zhao

{"title":"nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.","authors":"Linjie Wang, Wei Li, Fanghui Zhou, Kun Yu, Chaolu Feng, Dazhe Zhao","doi":"10.1093/bib/bbae477","DOIUrl":null,"url":null,"abstract":"<p><p>Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to \"dropout events\" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427072/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae477","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.

查看原文本刊更多论文

nsDCC：用于 scRNA-seq 数据分析的非均匀采样双层对比聚类。

降维和聚类是单细胞 RNA 测序（scRNA-seq）数据分析中的关键任务，但在目前的流程中，这两项任务被独立处理，阻碍了它们的互惠互利。最新的方法通过深度聚类联合优化了这些任务。然而，对比学习具有强大的表示能力，可以弥补普通深度聚类方法所面临的差距，即需要预先确定聚类中心。因此，我们提出了一种用于 scRNA-seq 数据分析的非均匀采样双层对比聚类方法（nsDCC）。双级对比聚类结合了实例级对比和聚类级对比，共同优化了降维和聚类。在实例级对比和聚类级对比中分别引入了多正向对比学习和单位矩阵约束。此外，还引入了注意力机制来捕捉细胞间的信息，这有利于聚类。nsDCC 通过提出的最近边界稀疏密度权重分配算法，重点关注类别边界和少数类别中的重要样本，使其能够捕捉不平衡数据集的综合特征。实验结果表明，nsDCC 在真实和模拟 scRNA-seq 数据上的表现优于其他六种最先进的方法，验证了它在 scRNA-seq 数据降维和聚类方面的性能，尤其是在不平衡数据上。模拟实验证明，nsDCC 对 scRNA-seq 中的 "丢失事件 "不敏感。最后，聚类差异表达基因分析证实了 nsDCC 结果的意义。总之，nsDCC 是分析和理解 scRNA-seq 数据的一种新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.