多类文档集合建模的Dirichlet混合分配

2009 Ninth IEEE International Conference on Data Mining Pub Date : 2009-12-06 DOI:10.1109/ICDM.2009.102

Wei Bian, D. Tao

{"title":"多类文档集合建模的Dirichlet混合分配","authors":"Wei Bian, D. Tao","doi":"10.1109/ICDM.2009.102","DOIUrl":null,"url":null,"abstract":"Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Dirichlet Mixture Allocation for Multiclass Document Collections Modeling\",\"authors\":\"Wei Bian, D. Tao\",\"doi\":\"10.1109/ICDM.2009.102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.\",\"PeriodicalId\":247645,\"journal\":{\"name\":\"2009 Ninth IEEE International Conference on Data Mining\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Ninth IEEE International Conference on Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2009.102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Ninth IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2009.102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

主题模型潜狄利克雷分配(Latent Dirichlet Allocation, LDA)是对大量文档进行统计分析的有效工具。在LDA中，每个文档被建模为主题的混合物，主题比例由单峰Dirichlet分布先验生成。当从多个类中提取文档集合时，这种单模态先验不足以进行数据拟合。为了解决这一问题，我们利用多模态Dirichlet混合先验，提出了Dirichlet混合分配(DMA)算法。我们报告了在流行的TDT2语料库上的实验，表明当文档来自多个类时，DMA比LDA更精确地建模文档集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dirichlet Mixture Allocation for Multiclass Document Collections Modeling

Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 Ninth IEEE International Conference on Data Mining

自引率

0.00%

发文量