Sub-Gibbs Sampling: A New Strategy for Inferring LDA

2017 IEEE International Conference on Data Mining (ICDM) Pub Date : 2017-11-01 DOI:10.1109/ICDM.2017.113

Chuan Hu, H. Cao, Qixu Gong

{"title":"Sub-Gibbs Sampling: A New Strategy for Inferring LDA","authors":"Chuan Hu, H. Cao, Qixu Gong","doi":"10.1109/ICDM.2017.113","DOIUrl":null,"url":null,"abstract":"Latent Dirichlet Allocation (LDA) has been widely used in text mining to discover topics from documents. One major approach to learn LDA is Gibbs sampling. The basic Collapsed Gibbs Sampling (CGS) algorithm requires O(NZ) computations to learn an LDA model with Z topics from a corpus containing N tokens. Existing approaches that improve the complexity of CGS focus on reducing the factor Z. In this work, we propose a novel and general Sub-Gibbs Sampling (SGS) strategy to improve the Gibbs-Sampling computation by reducing the sample space. This new strategy targets at reducing the factor N by sampling only a subset of the whole corpus. The design of the SGS strategy is based on two properties that we observe: (i) topic distributions of tokens are skewed and (ii) a subset of documents can approximately represent the semantics of the whole corpus. We prove that the SGS strategy can achieve comparable effectiveness (with bounded errors) and significantly reduce the complexity of existing Gibbs sampling algorithms. Extensive experiments on large real-world data sets show that the proposed SGS strategy is much faster than several state-of-the-art fast Gibbs sampling algorithms and the proposed SGS strategy can learn comparable LDA models as other Gibbs sampling algorithms.","PeriodicalId":254086,"journal":{"name":"2017 IEEE International Conference on Data Mining (ICDM)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2017.113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Latent Dirichlet Allocation (LDA) has been widely used in text mining to discover topics from documents. One major approach to learn LDA is Gibbs sampling. The basic Collapsed Gibbs Sampling (CGS) algorithm requires O(NZ) computations to learn an LDA model with Z topics from a corpus containing N tokens. Existing approaches that improve the complexity of CGS focus on reducing the factor Z. In this work, we propose a novel and general Sub-Gibbs Sampling (SGS) strategy to improve the Gibbs-Sampling computation by reducing the sample space. This new strategy targets at reducing the factor N by sampling only a subset of the whole corpus. The design of the SGS strategy is based on two properties that we observe: (i) topic distributions of tokens are skewed and (ii) a subset of documents can approximately represent the semantics of the whole corpus. We prove that the SGS strategy can achieve comparable effectiveness (with bounded errors) and significantly reduce the complexity of existing Gibbs sampling algorithms. Extensive experiments on large real-world data sets show that the proposed SGS strategy is much faster than several state-of-the-art fast Gibbs sampling algorithms and the proposed SGS strategy can learn comparable LDA models as other Gibbs sampling algorithms.

查看原文本刊更多论文

亚吉布斯抽样:一种推断LDA的新策略

潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)被广泛应用于文本挖掘中，从文档中发现主题。学习LDA的一个主要方法是Gibbs抽样。基本的崩塌吉布斯抽样(CGS)算法需要O(NZ)计算才能从包含N个令牌的语料库中学习具有Z个主题的LDA模型。现有的提高CGS复杂度的方法主要集中在减少因子z上，本文提出了一种新的通用的亚吉布斯采样(SGS)策略，通过减少样本空间来提高吉布斯采样的计算能力。这个新策略的目标是通过只采样整个语料库的一个子集来降低因子N。SGS策略的设计基于我们观察到的两个属性:(i)令牌的主题分布是倾斜的;(ii)文档子集可以近似地表示整个语料库的语义。我们证明了SGS策略可以达到相当的有效性(误差有界)，并显着降低了现有Gibbs采样算法的复杂性。在大型真实数据集上的大量实验表明，所提出的SGS策略比几种最先进的快速Gibbs采样算法要快得多，并且所提出的SGS策略可以学习与其他Gibbs采样算法相当的LDA模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量