Dirichlet Mixture Allocation for Multiclass Document Collections Modeling

2009 Ninth IEEE International Conference on Data Mining Pub Date : 2009-12-06 DOI:10.1109/ICDM.2009.102

Wei Bian, D. Tao

引用次数: 5

Abstract

Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.

查看原文本刊更多论文

多类文档集合建模的Dirichlet混合分配

主题模型潜狄利克雷分配(Latent Dirichlet Allocation, LDA)是对大量文档进行统计分析的有效工具。在LDA中，每个文档被建模为主题的混合物，主题比例由单峰Dirichlet分布先验生成。当从多个类中提取文档集合时，这种单模态先验不足以进行数据拟合。为了解决这一问题，我们利用多模态Dirichlet混合先验，提出了Dirichlet混合分配(DMA)算法。我们报告了在流行的TDT2语料库上的实验，表明当文档来自多个类时，DMA比LDA更精确地建模文档集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 Ninth IEEE International Conference on Data Mining

自引率

0.00%

发文量