Powerful significance testing for unbalanced clusters.

IF 1.8 2区数学 Q2 STATISTICS & PROBABILITY

Journal of Computational and Graphical Statistics Pub Date : 2025-04-16 DOI:10.1080/10618600.2025.2469756

Thomas H Keefe, J S Marron

引用次数: 0

Abstract

Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of $k$ -means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.

查看原文本刊更多论文

对不平衡集群进行强大的显著性检验。

聚类方法在揭示数据结构方面很受欢迎，特别是在当代数据科学中常见的高维环境中。一个核心的统计学问题是“集群真的存在吗？”SigClust是统计聚类验证的一个先驱方法，但在候选聚类大小不平衡的重要环境中，例如在罕见的疾病亚型中，它的功能严重不足。我们展示了为什么会出现这种情况，并提出了一种在不平衡和平衡环境下都有效的补救措施，使用k均值聚类的新泛化。我们使用肾癌患者基因表达的高维数据集来说明我们的方法的价值。Python实现可从https://github.com/thomaskeefe/sigclust获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational and Graphical Statistics 数学-统计学与概率论

CiteScore

3.50

自引率

8.30%

发文量

153

审稿时长

>12 weeks

期刊介绍： The Journal of Computational and Graphical Statistics (JCGS) presents the very latest techniques on improving and extending the use of computational and graphical methods in statistics and data analysis. Established in 1992, this journal contains cutting-edge research, data, surveys, and more on numerical graphical displays and methods, and perception. Articles are written for readers who have a strong background in statistics but are not necessarily experts in computing. Published in March, June, September, and December.