分层Dirichlet过程潜β - liouville主题模型的随机变分优化

Koffi Eddy Ihou, Manar Amayri, N. Bouguila
{"title":"分层Dirichlet过程潜β - liouville主题模型的随机变分优化","authors":"Koffi Eddy Ihou, Manar Amayri, N. Bouguila","doi":"10.1145/3502727","DOIUrl":null,"url":null,"abstract":"In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using these metrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)’s performance is the asymptote for parametric topic models. The accuracy in the results (improved predictive distributions of the held out) is a product of the model’s ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Stochastic Variational Optimization of a Hierarchical Dirichlet Process Latent Beta-Liouville Topic Model\",\"authors\":\"Koffi Eddy Ihou, Manar Amayri, N. Bouguila\",\"doi\":\"10.1145/3502727\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using these metrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)’s performance is the asymptote for parametric topic models. The accuracy in the results (improved predictive distributions of the held out) is a product of the model’s ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.\",\"PeriodicalId\":435653,\"journal\":{\"name\":\"ACM Transactions on Knowledge Discovery from Data (TKDD)\",\"volume\":\"144 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Knowledge Discovery from Data (TKDD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3502727\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3502727","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在主题模型中,集合被组织为文档,它们作为称为主题的潜在集群的混合物出现。主题是词汇表的分布。在大规模应用中,参数化或有限主题混合模型(如LDA (latent Dirichlet allocation))及其变体)由于其减少的假设空间而在性能上受到很大限制。在本文中,我们解决了在标准参数主题模型中与多个文档之间的主题选择和主题共享能力相关的问题。我们提出了一种替代的BNP(贝叶斯非参数)主题模型,其中HDP(层次狄利克雷过程)先验模型通过无限单纯形上的多项式记录主题混合。因此,我们提出不对称BL (Beta-Liouville)作为可测量空间上语料库水平DP (Dirichlet过程)的扩散基测度。这一步说明了描述语料库概率度量的所有主题集中的高度异构结构。为了后验推理和预测分布的一致性,我们有效地描述了随机概率测度,其极限是全局和局部dp,以近似于GEM (Griffiths-Engen-McCloskey)随机变量的粘断公式的HDP。由于将BL先验作为计数数据分布共轭的漫射测量,我们得到了通常基于对称Dirichlet (Dir)的标准HDP的改进版本。此外,为了改进坐标上升框架,同时利用其确定性特性,我们的模型实现了一种基于随机的在线优化方法,在文档级,变分推理,以适应快速主题学习,当处理具有自然梯度的大型文本文档集合时。与竞争对手的性能相比,每个文档的预测可能性的高值也与我们完全不对称的基于bl的HDP的鲁棒性一致。在使用保留文档的概率确保模型的预测准确性的同时,我们还添加了主题一致性和主题多样性等指标的组合,以提高所发现主题的质量和可解释性。我们还使用这些指标将模型的性能与标准对称LDA进行了比较。我们证明了在线HDP-LBLA (Latent BL Allocation)的性能是参数主题模型的渐近线。结果的准确性(改进了hold out的预测分布)是模型有效地描述文档之间依赖关系(主题相关性)的能力的产物,因为现在它们可以轻松地共享主题,从而为信息建模提供了更加健壮和现实的压缩算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Stochastic Variational Optimization of a Hierarchical Dirichlet Process Latent Beta-Liouville Topic Model
In topic models, collections are organized as documents where they arise as mixtures over latent clusters called topics. A topic is a distribution over the vocabulary. In large-scale applications, parametric or finite topic mixture models such as LDA (latent Dirichlet allocation) and its variants are very restrictive in performance due to their reduced hypothesis space. In this article, we address the problem related to model selection and sharing ability of topics across multiple documents in standard parametric topic models. We propose as an alternative a BNP (Bayesian nonparametric) topic model where the HDP (hierarchical Dirichlet process) prior models documents topic mixtures through their multinomials on infinite simplex. We, therefore, propose asymmetric BL (Beta-Liouville) as a diffuse base measure at the corpus level DP (Dirichlet process) over a measurable space. This step illustrates the highly heterogeneous structure in the set of all topics that describes the corpus probability measure. For consistency in posterior inference and predictive distributions, we efficiently characterize random probability measures whose limits are the global and local DPs to approximate the HDP from the stick-breaking formulation with the GEM (Griffiths-Engen-McCloskey) random variables. Due to the diffuse measure with the BL prior as conjugate to the count data distribution, we obtain an improved version of the standard HDP that is usually based on symmetric Dirichlet (Dir). In addition, to improve coordinate ascent framework while taking advantage of its deterministic nature, our model implements an online optimization method based on stochastic, at document level, variational inference to accommodate fast topic learning when processing large collections of text documents with natural gradient. The high value in the predictive likelihood per document obtained when compared to the performance of its competitors is also consistent with the robustness of our fully asymmetric BL-based HDP. While insuring the predictive accuracy of the model using the probability of the held-out documents, we also added a combination of metrics such as the topic coherence and topic diversity to improve the quality and interpretability of the topics discovered. We also compared the performance of our model using these metrics against the standard symmetric LDA. We show that online HDP-LBLA (Latent BL Allocation)’s performance is the asymptote for parametric topic models. The accuracy in the results (improved predictive distributions of the held out) is a product of the model’s ability to efficiently characterize dependency between documents (topic correlation) as now they can easily share topics, resulting in a much robust and realistic compression algorithm for information modeling.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信