Statistical significance of clustering for count data.

IF 1.7 4区数学 Q3 BIOLOGY

Biometrics Pub Date : 2025-07-03 DOI:10.1093/biomtc/ujaf120

Yifan Dai, Di Wu, Yufeng Liu

{"title":"Statistical significance of clustering for count data.","authors":"Yifan Dai, Di Wu, Yufeng Liu","doi":"10.1093/biomtc/ujaf120","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448855/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biomtc/ujaf120","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.

查看原文本刊更多论文

计数数据聚类的统计显著性。

聚类在生物医学研究中广泛应用于有意义的亚群识别。然而，大多数现有的聚类算法没有考虑到聚类的统计不确定性，因此可能会由于自然采样变化而产生虚假聚类。为了解决这个问题，开发了聚类的统计显著性（SigClust）方法来评估高维数据中聚类的显著性。虽然SigClust已经成功地评估了连续数据的聚类显著性，但它并不是专门为离散数据设计的，比如基因组学中的计数数据。此外，SigClust及其变体在应用于非高斯高维数据时可能会受到统计能力降低的影响。为了克服这些限制，我们提出了sigcluster - dev，这是一种旨在评估计数数据中集群重要性的方法。通过广泛的模拟，我们将sigcluster - dev与其他现有的sigcluster方法在各种计数分布中进行了比较，并证明了其优越的性能。此外，我们将我们提出的sigcluster - dev应用于Hydra单细胞RNA测序（scRNA）数据和癌症患者的电子健康记录（EHRs），分别识别有意义的潜在细胞类型和患者亚组。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biometrics 生物-生物学

CiteScore

2.70

自引率

5.30%

发文量

178

审稿时长

4-8 weeks

期刊介绍： The International Biometric Society is an international society promoting the development and application of statistical and mathematical theory and methods in the biosciences, including agriculture, biomedical science and public health, ecology, environmental sciences, forestry, and allied disciplines. The Society welcomes as members statisticians, mathematicians, biological scientists, and others devoted to interdisciplinary efforts in advancing the collection and interpretation of information in the biosciences. The Society sponsors the biennial International Biometric Conference, held in sites throughout the world; through its National Groups and Regions, it also Society sponsors regional and local meetings.