Bayesian Folding-In with Dirichlet Kernels for PLSI

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI:10.1109/ICDM.2007.15

Alexander Hinneburg, H. Gabriel, André Gohr

引用次数: 12

Abstract

Probabilistic latent semantic indexing (PLSI) represents documents of a collection as mixture proportions of latent topics, which are learned from the collection by an expectation maximization (EM) algorithm. New documents or queries need to be folded into the latent topic space by a simplified version of the EM-algorithm. During PLSI- Folding-in of a new document, the topic mixtures of the known documents are ignored. This may lead to a suboptimal model of the extended collection. Our new approach incorporates the topic mixtures of the known documents in a Bayesian way during folding- in. That knowledge is modeled as prior distribution over the topic simplex using a kernel density estimate of Dirichlet kernels. We demonstrate the advantages of the new Bayesian folding-in using real text data.

查看原文本刊更多论文

PLSI的Dirichlet核贝叶斯折叠

概率潜在语义索引(PLSI)将集合中的文档表示为潜在主题的混合比例，这些主题通过期望最大化(EM)算法从集合中学习。新的文档或查询需要通过em算法的简化版本折叠到潜在主题空间中。在新文档的PLSI- fold -in过程中，忽略已知文档的主题混合。这可能导致扩展集合的次优模型。我们的新方法在折叠过程中以贝叶斯方法合并了已知文档的主题混合。该知识被建模为主题单纯形上的先验分布，使用狄利克雷核的核密度估计。我们用真实的文本数据证明了新的贝叶斯折叠的优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Seventh IEEE International Conference on Data Mining (ICDM 2007)

自引率

0.00%

发文量