Selective inference for k-means clustering.

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research Pub Date : 2023-05-01

Yiqun T Chen, Daniela M Witten

引用次数: 0

Abstract

We consider the problem of testing for a difference in means between clusters of observations identified via $k$ -means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of $k$ -means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the $k$ -means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using $k$ -means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

Abstract Image

本刊更多论文

k-means 聚类的选择性推理。

我们考虑的问题是检验通过 k-means 聚类确定的观测数据聚类之间的均值差异。在这种情况下，经典的假设检验会导致 I 类错误率上升。在最近的工作中，Gao 等人（2022 年）考虑了分层聚类背景下的相关问题。遗憾的是，他们的解决方案与分层聚类的背景高度契合，因此无法应用于 k-means 聚类。在本文中，我们提出了一个 p 值，它是 k-means 算法中所有中间聚类分配的条件。我们证明，该 p 值可以控制在有限样本中使用 k-means 聚类对一对聚类的均值差异进行检验时的选择性 I 类错误，并且可以高效计算。我们将我们的建议应用于手写数字数据和单细胞 RNA 序列数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.