Powerful significance testing for unbalanced clusters.

IF 1.8 2区 数学 Q2 STATISTICS & PROBABILITY
Thomas H Keefe, J S Marron
{"title":"Powerful significance testing for unbalanced clusters.","authors":"Thomas H Keefe, J S Marron","doi":"10.1080/10618600.2025.2469756","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central <i>statistical</i> question is \"are the clusters really there?\" One pioneering method in statistical cluster validation is <i>SigClust</i>, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of <math><mi>k</mi></math> -means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.</p>","PeriodicalId":15422,"journal":{"name":"Journal of Computational and Graphical Statistics","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12338451/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational and Graphical Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1080/10618600.2025.2469756","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central statistical question is "are the clusters really there?" One pioneering method in statistical cluster validation is SigClust, but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of k -means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.

对不平衡集群进行强大的显著性检验。
聚类方法在揭示数据结构方面很受欢迎,特别是在当代数据科学中常见的高维环境中。一个核心的统计学问题是“集群真的存在吗?”SigClust是统计聚类验证的一个先驱方法,但在候选聚类大小不平衡的重要环境中,例如在罕见的疾病亚型中,它的功能严重不足。我们展示了为什么会出现这种情况,并提出了一种在不平衡和平衡环境下都有效的补救措施,使用k均值聚类的新泛化。我们使用肾癌患者基因表达的高维数据集来说明我们的方法的价值。Python实现可从https://github.com/thomaskeefe/sigclust获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.50
自引率
8.30%
发文量
153
审稿时长
>12 weeks
期刊介绍: The Journal of Computational and Graphical Statistics (JCGS) presents the very latest techniques on improving and extending the use of computational and graphical methods in statistics and data analysis. Established in 1992, this journal contains cutting-edge research, data, surveys, and more on numerical graphical displays and methods, and perception. Articles are written for readers who have a strong background in statistics but are not necessarily experts in computing. Published in March, June, September, and December.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信