Clustering Mixtures with Almost Optimal Separation in Polynomial Time

IF 1.2 3区 计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS
Jerry Li, Allen Liu
{"title":"Clustering Mixtures with Almost Optimal Separation in Polynomial Time","authors":"Jerry Li, Allen Liu","doi":"10.1137/22m1538788","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Computing, Ahead of Print. <br/> Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.","PeriodicalId":49532,"journal":{"name":"SIAM Journal on Computing","volume":"2015 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1137/22m1538788","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

SIAM Journal on Computing, Ahead of Print.
Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.
在多项式时间内实现几乎最优分离的混合物聚类
SIAM 计算期刊》,提前印刷。 摘要我们考虑的是高维度均值分离高斯混合物的聚类问题。我们从[math]同一协方差高斯混合物中得到样本,因此对于某个参数[math],任意两对均值之间的最小成对距离至少为[math],目标是恢复这些样本的基本真实聚类。民间传说,分离 [math] 是恢复良好聚类的必要条件和充分条件(例如,误差恒定或 [math]),至少在信息理论上是这样。然而,实现这一保证的估计器效率很低。我们给出了第一种在[math]和[math]维度下都能以多项式时间运行的算法,它几乎与这一保证相匹配。更准确地说,我们给出了一种算法,它需要的样本和时间都是多项式的,而且只要分离度是[math],对于任意[math],它都能成功地恢复一个好的聚类。在此之前,只有当分离度为[math]的多项式时,这个问题的多项式时间算法才是已知的,而所有能容忍[math]分离度的算法都需要准多项式时间。在额外的温和假设条件下,我们还将结果扩展到了满足泊恩卡不等式的分布的平移混合物。我们的主要技术工具是一种隐式表示和估计分布高阶矩的新方法,我们认为它具有独立的意义,它允许我们提取高阶矩的重要信息,而无需明确写下完整的矩张量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SIAM Journal on Computing
SIAM Journal on Computing 工程技术-计算机:理论方法
CiteScore
4.60
自引率
0.00%
发文量
68
审稿时长
6-12 weeks
期刊介绍: The SIAM Journal on Computing aims to provide coverage of the most significant work going on in the mathematical and formal aspects of computer science and nonnumerical computing. Submissions must be clearly written and make a significant technical contribution. Topics include but are not limited to analysis and design of algorithms, algorithmic game theory, data structures, computational complexity, computational algebra, computational aspects of combinatorics and graph theory, computational biology, computational geometry, computational robotics, the mathematical aspects of programming languages, artificial intelligence, computational learning, databases, information retrieval, cryptography, networks, distributed computing, parallel algorithms, and computer architecture.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信