Big topic modeling based on a two-level hierarchical latent Beta-Liouville allocation for large-scale data and parameter streaming

IF 3.7 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Analysis and Applications Pub Date : 2024-02-28 DOI:10.1007/s10044-024-01213-y

Koffi Eddy Ihou, Nizar Bouguila

{"title":"Big topic modeling based on a two-level hierarchical latent Beta-Liouville allocation for large-scale data and parameter streaming","authors":"Koffi Eddy Ihou, Nizar Bouguila","doi":"10.1007/s10044-024-01213-y","DOIUrl":null,"url":null,"abstract":"<p>As an extension to the standard symmetric latent Dirichlet allocation topic model, we implement asymmetric Beta-Liouville as a conjugate prior to the multinomial and therefore propose the maximum a posteriori for latent Beta-Liouville allocation as an alternative to maximum likelihood estimator for models such as probabilistic latent semantic indexing, unigrams, and mixture of unigrams. Since most Bayesian posteriors, for complex models, are intractable in general, we propose a point estimate (the mode) that offers a much tractable solution. The maximum a posteriori hypotheses using point estimates are much easier than full Bayesian analysis that integrates over the entire parameter space. We show that the proposed maximum a posteriori reduces the three-level hierarchical latent Beta-Liouville allocation to two-level topic mixture as we marginalize out the latent variables. In each document, the maximum a posteriori provides a soft assignment and constructs dense expectation–maximization probabilities over each word (responsibilities) for accurate estimates. For simplicity, we present a stochastic at word-level online expectation–maximization algorithm as an optimization method for maximum a posteriori latent Beta-Liouville allocation estimation whose unnormalized reparameterization is equivalent to a stochastic collapsed variational Bayes. This implicit connection between the collapsed space and expectation–maximization-based maximum a posteriori latent Beta-Liouville allocation shows its flexibility and helps in providing alternative to model selection. We characterize efficiency in the proposed approach for its ability to simultaneously stream both large-scale data and parameters seamlessly. The performance of the model using predictive perplexities as evaluation method shows the robustness of the proposed technique with text document datasets.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"80 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Analysis and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10044-024-01213-y","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As an extension to the standard symmetric latent Dirichlet allocation topic model, we implement asymmetric Beta-Liouville as a conjugate prior to the multinomial and therefore propose the maximum a posteriori for latent Beta-Liouville allocation as an alternative to maximum likelihood estimator for models such as probabilistic latent semantic indexing, unigrams, and mixture of unigrams. Since most Bayesian posteriors, for complex models, are intractable in general, we propose a point estimate (the mode) that offers a much tractable solution. The maximum a posteriori hypotheses using point estimates are much easier than full Bayesian analysis that integrates over the entire parameter space. We show that the proposed maximum a posteriori reduces the three-level hierarchical latent Beta-Liouville allocation to two-level topic mixture as we marginalize out the latent variables. In each document, the maximum a posteriori provides a soft assignment and constructs dense expectation–maximization probabilities over each word (responsibilities) for accurate estimates. For simplicity, we present a stochastic at word-level online expectation–maximization algorithm as an optimization method for maximum a posteriori latent Beta-Liouville allocation estimation whose unnormalized reparameterization is equivalent to a stochastic collapsed variational Bayes. This implicit connection between the collapsed space and expectation–maximization-based maximum a posteriori latent Beta-Liouville allocation shows its flexibility and helps in providing alternative to model selection. We characterize efficiency in the proposed approach for its ability to simultaneously stream both large-scale data and parameters seamlessly. The performance of the model using predictive perplexities as evaluation method shows the robustness of the proposed technique with text document datasets.

Abstract Image

查看原文本刊更多论文

基于两级潜在 Beta-Liouville 分配的大主题建模，适用于大规模数据和参数流

作为标准对称潜狄利克特分配主题模型的扩展，我们将非对称贝塔-利乌维尔作为多项式的共轭先验来实现，并因此提出了潜贝塔-利乌维尔分配的最大后验，作为概率潜语义索引、单词和单词混合等模型的最大似然估计的替代方法。由于对于复杂模型来说，大多数贝叶斯后验一般都很难处理，因此我们提出了一种点估计（模式），提供了一种更易处理的解决方案。使用点估计的最大后验假设比整合整个参数空间的完全贝叶斯分析要容易得多。我们的研究表明，当我们将潜在变量边际化时，所提出的最大后验假设将三级分层潜在 Beta-Liouville 分配减少为两级主题混合。在每篇文档中，最大后验法都会提供软分配，并在每个词（责任）上构建密集的期望最大化概率，以获得准确的估计值。为简单起见，我们提出了一种词级随机在线期望最大化算法，作为最大后验潜变量 Beta-Liouville 分配估计的优化方法，其非规范化重参数化等同于随机坍塌变分贝叶斯。坍缩空间与基于期望最大化的最大后验潜在 Beta-Liouville 分配之间的这种隐含联系显示了它的灵活性，并有助于提供模型选择的替代方法。我们对所提出的方法的效率进行了描述，因为它能够同时无缝流式处理大规模数据和参数。使用预测复杂度作为评估方法的模型性能表明了所提技术在文本文档数据集上的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Analysis and Applications 工程技术-计算机：人工智能

CiteScore

7.40

自引率

2.60%

发文量

审稿时长

13.5 months

期刊介绍： The journal publishes high quality articles in areas of fundamental research in intelligent pattern analysis and applications in computer science and engineering. It aims to provide a forum for original research which describes novel pattern analysis techniques and industrial applications of the current technology. In addition, the journal will also publish articles on pattern analysis applications in medical imaging. The journal solicits articles that detail new technology and methods for pattern recognition and analysis in applied domains including, but not limited to, computer vision and image processing, speech analysis, robotics, multimedia, document analysis, character recognition, knowledge engineering for pattern recognition, fractal analysis, and intelligent control. The journal publishes articles on the use of advanced pattern recognition and analysis methods including statistical techniques, neural networks, genetic algorithms, fuzzy pattern recognition, machine learning, and hardware implementations which are either relevant to the development of pattern analysis as a research area or detail novel pattern analysis applications. Papers proposing new classifier systems or their development, pattern analysis systems for real-time applications, fuzzy and temporal pattern recognition and uncertainty management in applied pattern recognition are particularly solicited.