Clustering with Spectral Norm and the k-Means Algorithm

2010 IEEE 51st Annual Symposium on Foundations of Computer Science Pub Date : 2010-04-11 DOI:10.1109/FOCS.2010.35

Amit Kumar, R. Kannan

{"title":"Clustering with Spectral Norm and the k-Means Algorithm","authors":"Amit Kumar, R. Kannan","doi":"10.1109/FOCS.2010.35","DOIUrl":null,"url":null,"abstract":"There has been much progress on efficient algorithms for clustering data points generated by a mixture of k probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a \"proximity condition'': the projection of any data point onto the line joining its cluster center to any other cluster center is Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known k-means algorithm, and along the way, we prove a result of independent interest – that the k-means algorithm converges to the \"true centers'' even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning certain mixture of distributions under weaker separation conditions.","PeriodicalId":228365,"journal":{"name":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"188","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 51st Annual Symposium on Foundations of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2010.35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 188

Abstract

There has been much progress on efficient algorithms for clustering data points generated by a mixture of k probability distributions under the assumption that the means of the distributions are well-separated, i.e., the distance between the means of any two distributions is at least Omega(k) standard deviations. These results generally make heavy use of the generative model and particular properties of the distributions. In this paper, we show that a simple clustering algorithm works without assuming any generative (probabilistic) model. Our only assumption is what we call a "proximity condition'': the projection of any data point onto the line joining its cluster center to any other cluster center is Omega(k) standard deviations closer to its own center than the other center. Here the notion of standard deviations is based on the spectral norm of the matrix whose rows represent the difference between a point and the mean of the cluster to which it belongs. We show that in the generative models studied, our proximity condition is satisfied and so we are able to derive most known results for generative models as corollaries of our main result. We also prove some new results for generative models - e.g., we can cluster all but a small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known k-means algorithm, and along the way, we prove a result of independent interest – that the k-means algorithm converges to the "true centers'' even in the presence of spurious points provided the initial (estimated) centers are close enough to the corresponding actual centers and all but a small fraction of the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of inter-center separation to standard deviation. This allows us to prove results for learning certain mixture of distributions under weaker separation conditions.

查看原文本刊更多论文

谱范数聚类与k-均值算法

在假设分布的均值分离良好(即任意两个分布的均值之间的距离至少为ω (k)个标准差)的情况下，对由k个概率分布的混合产生的数据点进行聚类的有效算法已经取得了很大进展。这些结果通常大量使用生成模型和分布的特殊性质。在本文中，我们证明了一个简单的聚类算法不需要假设任何生成(概率)模型。我们唯一的假设是我们所谓的“接近条件”:任何数据点在连接其集群中心和任何其他集群中心的直线上的投影是Omega(k)个标准差，比另一个中心更接近它自己的中心。在这里，标准差的概念是基于矩阵的谱范数，矩阵的行表示一个点与其所属簇的均值之间的差。我们表明，在所研究的生成模型中，我们的接近条件是满足的，因此我们能够推导出大多数已知的生成模型的结果作为我们主要结果的推论。我们还证明了生成模型的一些新结果——例如，我们可以聚类除了一小部分点之外的所有点，只需要假设方差的一个界。我们的算法依赖于众所周知的k-means算法，并且在此过程中，我们证明了一个独立的结果-即使在存在伪点的情况下，k-means算法收敛到“真实中心”，前提是初始(估计)中心与相应的实际中心足够接近，并且除了一小部分点之外，所有点都满足接近条件。最后，我们提出了一种提高中心间距与标准差比值的新技术。这使我们能够证明在较弱分离条件下学习某些分布混合物的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE 51st Annual Symposium on Foundations of Computer Science

自引率

0.00%

发文量