{"title":"scGHSOM:利用生长分层 SOM 对单细胞和 CRISPR 数据进行分层聚类和可视化处理","authors":"Shang-Jung Wen, Jia-Ming Chang, Fang Yu","doi":"arxiv-2407.16984","DOIUrl":null,"url":null,"abstract":"High-dimensional single-cell data poses significant challenges in identifying\nunderlying biological patterns due to the complexity and heterogeneity of\ncellular states. We propose a comprehensive gene-cell dependency visualization\nvia unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\nspecifically designed for analyzing high-dimensional single-cell data like\nsingle-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\nin a hierarchical structure such that the self-growth structure of clusters\nsatisfies the required variations between and within. We propose a novel\nSignificant Attributes Identification Algorithm to identify features that\ndistinguish clusters. This algorithm pinpoints attributes with minimal\nvariation within a cluster but substantial variation between clusters. These\nkey attributes can then be used for targeted data retrieval and downstream\nanalysis. Furthermore, we present two innovative visualization tools: Cluster\nFeature Map and Cluster Distribution Map. The Cluster Feature Map highlights\nthe distribution of specific features across the hierarchical structure of\nGHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\nbased on chosen features. The Cluster Distribution Map depicts leaf clusters as\ncircles on the GHSOM grid, with circle size reflecting cluster data size and\ncolor customizable to visualize features like cell type or other attributes. We\napply our analysis to three single-cell datasets and one CRISPR dataset\n(cell-gene database) and evaluate clustering methods with internal and external\nCH and ARI scores. GHSOM performs well, being the best performer in internal\nevaluation (CH=4.2). In external evaluation, GHSOM has the third-best\nperformance of all methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM\",\"authors\":\"Shang-Jung Wen, Jia-Ming Chang, Fang Yu\",\"doi\":\"arxiv-2407.16984\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-dimensional single-cell data poses significant challenges in identifying\\nunderlying biological patterns due to the complexity and heterogeneity of\\ncellular states. We propose a comprehensive gene-cell dependency visualization\\nvia unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\\nspecifically designed for analyzing high-dimensional single-cell data like\\nsingle-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\\nin a hierarchical structure such that the self-growth structure of clusters\\nsatisfies the required variations between and within. We propose a novel\\nSignificant Attributes Identification Algorithm to identify features that\\ndistinguish clusters. This algorithm pinpoints attributes with minimal\\nvariation within a cluster but substantial variation between clusters. These\\nkey attributes can then be used for targeted data retrieval and downstream\\nanalysis. Furthermore, we present two innovative visualization tools: Cluster\\nFeature Map and Cluster Distribution Map. The Cluster Feature Map highlights\\nthe distribution of specific features across the hierarchical structure of\\nGHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\\nbased on chosen features. The Cluster Distribution Map depicts leaf clusters as\\ncircles on the GHSOM grid, with circle size reflecting cluster data size and\\ncolor customizable to visualize features like cell type or other attributes. We\\napply our analysis to three single-cell datasets and one CRISPR dataset\\n(cell-gene database) and evaluate clustering methods with internal and external\\nCH and ARI scores. GHSOM performs well, being the best performer in internal\\nevaluation (CH=4.2). In external evaluation, GHSOM has the third-best\\nperformance of all methods.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.16984\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
由于细胞状态的复杂性和异质性,高维单细胞数据给识别潜在的生物模式带来了巨大挑战。我们提出了一种通过无监督聚类实现基因-细胞依赖关系可视化的综合方法--生长分层自组织图(GHSOM),专门用于分析单细胞测序和CRISPR筛选的高维单细胞数据。GHSOM 采用分层结构对样本进行聚类,这样聚类的自生长结构就能满足样本之间和样本内部的变化要求。我们提出了一种新颖的 "重要属性识别算法"(Significant Attributes Identification Algorithm)来识别区分聚类的特征。该算法能找出在聚类内部变化最小,但在聚类之间变化很大的属性。这些关键属性可用于有针对性的数据检索和下游分析。此外,我们还介绍了两种创新的可视化工具:聚类特征图(ClusterFeature Map)和聚类分布图(Cluster Distribution Map)。聚类特征图突出显示了特定特征在 GHSOM 聚类分层结构中的分布。这样就可以根据所选特征快速直观地评估聚类的独特性。簇分布图将叶簇描绘成 GHSOM 网格上的圆圈,圆圈大小反映了簇数据的大小,颜色可自定义,以直观显示细胞类型或其他属性等特征。我们将分析结果应用于三个单细胞数据集和一个 CRISPR 数据集(细胞基因数据库),并用内部、外部CH 和 ARI 分数评估聚类方法。GHSOM 表现出色,是内部评估中表现最好的方法(CH=4.2)。在外部评估中,GHSOM 的表现在所有方法中名列第三。
scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM
High-dimensional single-cell data poses significant challenges in identifying
underlying biological patterns due to the complexity and heterogeneity of
cellular states. We propose a comprehensive gene-cell dependency visualization
via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),
specifically designed for analyzing high-dimensional single-cell data like
single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples
in a hierarchical structure such that the self-growth structure of clusters
satisfies the required variations between and within. We propose a novel
Significant Attributes Identification Algorithm to identify features that
distinguish clusters. This algorithm pinpoints attributes with minimal
variation within a cluster but substantial variation between clusters. These
key attributes can then be used for targeted data retrieval and downstream
analysis. Furthermore, we present two innovative visualization tools: Cluster
Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights
the distribution of specific features across the hierarchical structure of
GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness
based on chosen features. The Cluster Distribution Map depicts leaf clusters as
circles on the GHSOM grid, with circle size reflecting cluster data size and
color customizable to visualize features like cell type or other attributes. We
apply our analysis to three single-cell datasets and one CRISPR dataset
(cell-gene database) and evaluate clustering methods with internal and external
CH and ARI scores. GHSOM performs well, being the best performer in internal
evaluation (CH=4.2). In external evaluation, GHSOM has the third-best
performance of all methods.