Sub-Gibbs Sampling: A New Strategy for Inferring LDA

Chuan Hu, H. Cao, Qixu Gong
{"title":"Sub-Gibbs Sampling: A New Strategy for Inferring LDA","authors":"Chuan Hu, H. Cao, Qixu Gong","doi":"10.1109/ICDM.2017.113","DOIUrl":null,"url":null,"abstract":"Latent Dirichlet Allocation (LDA) has been widely used in text mining to discover topics from documents. One major approach to learn LDA is Gibbs sampling. The basic Collapsed Gibbs Sampling (CGS) algorithm requires O(NZ) computations to learn an LDA model with Z topics from a corpus containing N tokens. Existing approaches that improve the complexity of CGS focus on reducing the factor Z. In this work, we propose a novel and general Sub-Gibbs Sampling (SGS) strategy to improve the Gibbs-Sampling computation by reducing the sample space. This new strategy targets at reducing the factor N by sampling only a subset of the whole corpus. The design of the SGS strategy is based on two properties that we observe: (i) topic distributions of tokens are skewed and (ii) a subset of documents can approximately represent the semantics of the whole corpus. We prove that the SGS strategy can achieve comparable effectiveness (with bounded errors) and significantly reduce the complexity of existing Gibbs sampling algorithms. Extensive experiments on large real-world data sets show that the proposed SGS strategy is much faster than several state-of-the-art fast Gibbs sampling algorithms and the proposed SGS strategy can learn comparable LDA models as other Gibbs sampling algorithms.","PeriodicalId":254086,"journal":{"name":"2017 IEEE International Conference on Data Mining (ICDM)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2017.113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Latent Dirichlet Allocation (LDA) has been widely used in text mining to discover topics from documents. One major approach to learn LDA is Gibbs sampling. The basic Collapsed Gibbs Sampling (CGS) algorithm requires O(NZ) computations to learn an LDA model with Z topics from a corpus containing N tokens. Existing approaches that improve the complexity of CGS focus on reducing the factor Z. In this work, we propose a novel and general Sub-Gibbs Sampling (SGS) strategy to improve the Gibbs-Sampling computation by reducing the sample space. This new strategy targets at reducing the factor N by sampling only a subset of the whole corpus. The design of the SGS strategy is based on two properties that we observe: (i) topic distributions of tokens are skewed and (ii) a subset of documents can approximately represent the semantics of the whole corpus. We prove that the SGS strategy can achieve comparable effectiveness (with bounded errors) and significantly reduce the complexity of existing Gibbs sampling algorithms. Extensive experiments on large real-world data sets show that the proposed SGS strategy is much faster than several state-of-the-art fast Gibbs sampling algorithms and the proposed SGS strategy can learn comparable LDA models as other Gibbs sampling algorithms.
亚吉布斯抽样:一种推断LDA的新策略
潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)被广泛应用于文本挖掘中,从文档中发现主题。学习LDA的一个主要方法是Gibbs抽样。基本的崩塌吉布斯抽样(CGS)算法需要O(NZ)计算才能从包含N个令牌的语料库中学习具有Z个主题的LDA模型。现有的提高CGS复杂度的方法主要集中在减少因子z上,本文提出了一种新的通用的亚吉布斯采样(SGS)策略,通过减少样本空间来提高吉布斯采样的计算能力。这个新策略的目标是通过只采样整个语料库的一个子集来降低因子N。SGS策略的设计基于我们观察到的两个属性:(i)令牌的主题分布是倾斜的;(ii)文档子集可以近似地表示整个语料库的语义。我们证明了SGS策略可以达到相当的有效性(误差有界),并显着降低了现有Gibbs采样算法的复杂性。在大型真实数据集上的大量实验表明,所提出的SGS策略比几种最先进的快速Gibbs采样算法要快得多,并且所提出的SGS策略可以学习与其他Gibbs采样算法相当的LDA模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信