{"title":"scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM","authors":"Shang-Jung Wen, Jia-Ming Chang, Fang Yu","doi":"arxiv-2407.16984","DOIUrl":null,"url":null,"abstract":"High-dimensional single-cell data poses significant challenges in identifying\nunderlying biological patterns due to the complexity and heterogeneity of\ncellular states. We propose a comprehensive gene-cell dependency visualization\nvia unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),\nspecifically designed for analyzing high-dimensional single-cell data like\nsingle-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples\nin a hierarchical structure such that the self-growth structure of clusters\nsatisfies the required variations between and within. We propose a novel\nSignificant Attributes Identification Algorithm to identify features that\ndistinguish clusters. This algorithm pinpoints attributes with minimal\nvariation within a cluster but substantial variation between clusters. These\nkey attributes can then be used for targeted data retrieval and downstream\nanalysis. Furthermore, we present two innovative visualization tools: Cluster\nFeature Map and Cluster Distribution Map. The Cluster Feature Map highlights\nthe distribution of specific features across the hierarchical structure of\nGHSOM clusters. This allows for rapid visual assessment of cluster uniqueness\nbased on chosen features. The Cluster Distribution Map depicts leaf clusters as\ncircles on the GHSOM grid, with circle size reflecting cluster data size and\ncolor customizable to visualize features like cell type or other attributes. We\napply our analysis to three single-cell datasets and one CRISPR dataset\n(cell-gene database) and evaluate clustering methods with internal and external\nCH and ARI scores. GHSOM performs well, being the best performer in internal\nevaluation (CH=4.2). In external evaluation, GHSOM has the third-best\nperformance of all methods.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.16984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
High-dimensional single-cell data poses significant challenges in identifying
underlying biological patterns due to the complexity and heterogeneity of
cellular states. We propose a comprehensive gene-cell dependency visualization
via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),
specifically designed for analyzing high-dimensional single-cell data like
single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples
in a hierarchical structure such that the self-growth structure of clusters
satisfies the required variations between and within. We propose a novel
Significant Attributes Identification Algorithm to identify features that
distinguish clusters. This algorithm pinpoints attributes with minimal
variation within a cluster but substantial variation between clusters. These
key attributes can then be used for targeted data retrieval and downstream
analysis. Furthermore, we present two innovative visualization tools: Cluster
Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights
the distribution of specific features across the hierarchical structure of
GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness
based on chosen features. The Cluster Distribution Map depicts leaf clusters as
circles on the GHSOM grid, with circle size reflecting cluster data size and
color customizable to visualize features like cell type or other attributes. We
apply our analysis to three single-cell datasets and one CRISPR dataset
(cell-gene database) and evaluate clustering methods with internal and external
CH and ARI scores. GHSOM performs well, being the best performer in internal
evaluation (CH=4.2). In external evaluation, GHSOM has the third-best
performance of all methods.