Learning mixtures of arbitrary distributions over large discrete domains

Proceedings of the 5th conference on Innovations in theoretical computer science Pub Date : 2012-12-06 DOI:10.1145/2554797.2554818

Y. Rabani, L. Schulman, Chaitanya Swamy

{"title":"Learning mixtures of arbitrary distributions over large discrete domains","authors":"Y. Rabani, L. Schulman, Chaitanya Swamy","doi":"10.1145/2554797.2554818","DOIUrl":null,"url":null,"abstract":"We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n]={1, 2, ...,n} and the mixture weights, using O(n polylog n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the \"sampling aperture\", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k-1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.","PeriodicalId":382856,"journal":{"name":"Proceedings of the 5th conference on Innovations in theoretical computer science","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th conference on Innovations in theoretical computer science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2554797.2554818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

Abstract

We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents of a mixture of k arbitrary distributions over a large discrete domain [n]={1, 2, ...,n} and the mixture weights, using O(n polylog n) samples. (In the topic-model learning setting, the mixture constituents correspond to the topic distributions.) This task is information-theoretically impossible for k > 1 under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We obtain the first bounds for this mixture-learning problem without imposing any assumptions on the mixture constituents. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of 2k-1. Thus, we achieve near-optimal dependence on n and optimal aperture. While the sample-size required by our algorithm depends exponentially on k, we prove that such a dependence is unavoidable when one considers general mixtures. A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.

查看原文本刊更多论文

学习大离散域上任意分布的混合

我们给出了一种学习混合非结构化分布的算法。这个问题出现在各种无监督学习场景中，例如从跨越多个主题的文档语料库中学习主题模型。我们展示了如何在一个大的离散域[n]={1,2，…]上学习k个任意分布的混合物的组成部分。，n}和混合权值，使用O(n polylogn)个样本。(在主题模型学习设置中，混合成分对应于主题分布。)在通常的混合分布抽样过程中，如果k > 1，这个任务在信息理论上是不可能的。然而，在某些情况下(如上述主题模型案例)，每个样本点由来自同一混合成分的多个观测值组成。我们称之为“采样孔径”的观测次数是这个问题的一个关键参数。在没有对混合成分施加任何假设的情况下，我们得到了这个混合学习问题的第一个界。我们证明了有效的学习是可能的，正是在信息理论的最小可能孔径2k-1。因此，我们实现了对n和最优孔径的近最优依赖。虽然我们的算法所需的样本量以指数形式依赖于k，但我们证明，当考虑一般混合时，这种依赖是不可避免的。一系列工具有助于算法，如随机矩阵的浓度结果，降维，矩估计和灵敏度分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th conference on Innovations in theoretical computer science

自引率

0.00%

发文量