{"title":"More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data","authors":"Dan Feldman, Tamir Tassa","doi":"10.1145/2783258.2783312","DOIUrl":null,"url":null,"abstract":"We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ellz error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n x d matrix into a small, ε-dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1+ε)-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of non-zeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery [21](Best paper of KDD 2013) and answer affirmably, and in a more general setting, his open question of computing such a coreset. Source code is provided for reproducing the experiments and integration with existing and future algorithms.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ellz error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n x d matrix into a small, ε-dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1+ε)-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of non-zeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery [21](Best paper of KDD 2013) and answer affirmably, and in a more general setting, his open question of computing such a coreset. Source code is provided for reproducing the experiments and integration with existing and future algorithms.
我们提出了一种通用的数据约简技术,该技术具有可证明的保证,用于在某些$ellz误差下计算矩阵的低秩近似,以及约束分解,例如非负矩阵分解(NMF)。我们的主要算法将给定的n x d矩阵减少为其行(称为coreset)的一个小的,ε相关的加权子集C,其大小与n和d无关。然后我们证明在结果coreset上应用现有算法可以转化为原始(大)输入矩阵的(1+ε)近似。特别地,我们提供了第一个线性时间近似方案(LTAS)的秩一NMF。核心集C可以并行计算,并且只使用一次遍历可能无界的行向量流。在这个意义上,我们改进了[4]中的结果(STOC 2013年最佳论文)。此外,由于C是这些行的子集,它的构造时间,以及它的稀疏性(非零条目的数量)和产生的低秩近似的稀疏性取决于输入行的最大稀疏性,而不是实际维数d。从这个意义上说,我们改进了Libery[21]的结果(KDD 2013年的最佳论文),并肯定地回答了他在更一般的情况下计算这样一个核心集的开放问题。源代码提供了再现实验和集成现有和未来的算法。