k均值聚类和低秩逼近的降维方法

Michael B. Cohen, Sam Elder, Cameron Musco, C. Musco, Madalina Persu
{"title":"k均值聚类和低秩逼近的降维方法","authors":"Michael B. Cohen, Sam Elder, Cameron Musco, C. Musco, Madalina Persu","doi":"10.1145/2746539.2746569","DOIUrl":null,"url":null,"abstract":"We show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+ε) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A}, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9+ε) approximation by Johnson-Lindenstrauss projecting data to just O(log k/ε2) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.","PeriodicalId":20566,"journal":{"name":"Proceedings of the forty-seventh annual ACM symposium on Theory of Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"330","resultStr":"{\"title\":\"Dimensionality Reduction for k-Means Clustering and Low Rank Approximation\",\"authors\":\"Michael B. Cohen, Sam Elder, Cameron Musco, C. Musco, Madalina Persu\",\"doi\":\"10.1145/2746539.2746569\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+ε) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A}, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9+ε) approximation by Johnson-Lindenstrauss projecting data to just O(log k/ε2) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.\",\"PeriodicalId\":20566,\"journal\":{\"name\":\"Proceedings of the forty-seventh annual ACM symposium on Theory of Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"330\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the forty-seventh annual ACM symposium on Theory of Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2746539.2746569\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-seventh annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2746539.2746569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 330

摘要

我们展示了如何用一个更小的草图来近似一个数据矩阵a,它可以用来解决一类一般的约束k-秩近似问题,误差在(1+ε)以内。重要的是,这类包括k-means聚类和无约束低秩近似(即主成分分析)。通过将数据点减少到仅O(k)个维度,我们通常可以加速处理这些普遍存在的问题的任何精确、近似或启发式算法。对于k-means降维,我们为许多常见的素描技术提供了(1+ε)相对误差结果,包括随机行投影、列选择和近似奇异值分解。对于近似主成分分析,我们给出了一种简单的替代已知算法,该算法在流设置中具有应用。此外,我们扩展了最近关于基于列矩阵重构的工作,给出了列子集,这些列子集不仅“覆盖”了a}的一个很好的子空间,而且可以直接用于计算这个子空间。最后,对于k-means聚类,我们展示了如何通过Johnson-Lindenstrauss将数据投影到O(log k/ε2)维来实现(9+ε)近似。这是第一个利用k-means的特定结构来实现维度独立于输入大小和k的亚线性的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
We show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+ε) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A}, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9+ε) approximation by Johnson-Lindenstrauss projecting data to just O(log k/ε2) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信