在多项式时间内实现几乎最优分离的混合物聚类

IF 1.6 3区计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS

SIAM Journal on Computing Pub Date : 2024-02-22 DOI:10.1137/22m1538788

Jerry Li, Allen Liu

{"title":"在多项式时间内实现几乎最优分离的混合物聚类","authors":"Jerry Li, Allen Liu","doi":"10.1137/22m1538788","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Computing, Ahead of Print. <br/> Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.","PeriodicalId":49532,"journal":{"name":"SIAM Journal on Computing","volume":"2015 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clustering Mixtures with Almost Optimal Separation in Polynomial Time\",\"authors\":\"Jerry Li, Allen Liu\",\"doi\":\"10.1137/22m1538788\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SIAM Journal on Computing, Ahead of Print. <br/> Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.\",\"PeriodicalId\":49532,\"journal\":{\"name\":\"SIAM Journal on Computing\",\"volume\":\"2015 1\",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2024-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM Journal on Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1137/22m1538788\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1137/22m1538788","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

SIAM 计算期刊》，提前印刷。摘要我们考虑的是高维度均值分离高斯混合物的聚类问题。我们从[math]同一协方差高斯混合物中得到样本，因此对于某个参数[math]，任意两对均值之间的最小成对距离至少为[math]，目标是恢复这些样本的基本真实聚类。民间传说，分离 [math] 是恢复良好聚类的必要条件和充分条件（例如，误差恒定或 [math]），至少在信息理论上是这样。然而，实现这一保证的估计器效率很低。我们给出了第一种在[math]和[math]维度下都能以多项式时间运行的算法，它几乎与这一保证相匹配。更准确地说，我们给出了一种算法，它需要的样本和时间都是多项式的，而且只要分离度是[math]，对于任意[math]，它都能成功地恢复一个好的聚类。在此之前，只有当分离度为[math]的多项式时，这个问题的多项式时间算法才是已知的，而所有能容忍[math]分离度的算法都需要准多项式时间。在额外的温和假设条件下，我们还将结果扩展到了满足泊恩卡不等式的分布的平移混合物。我们的主要技术工具是一种隐式表示和估计分布高阶矩的新方法，我们认为它具有独立的意义，它允许我们提取高阶矩的重要信息，而无需明确写下完整的矩张量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

SIAM Journal on Computing, Ahead of Print.
Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIAM Journal on Computing 工程技术-计算机：理论方法

CiteScore

4.60

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The SIAM Journal on Computing aims to provide coverage of the most significant work going on in the mathematical and formal aspects of computer science and nonnumerical computing. Submissions must be clearly written and make a significant technical contribution. Topics include but are not limited to analysis and design of algorithms, algorithmic game theory, data structures, computational complexity, computational algebra, computational aspects of combinatorics and graph theory, computational biology, computational geometry, computational robotics, the mathematical aspects of programming languages, artificial intelligence, computational learning, databases, information retrieval, cryptography, networks, distributed computing, parallel algorithms, and computer architecture.