PTM: probabilistic topic mapping model for mining parallel document collections

Proceedings of the 19th ACM international conference on Information and knowledge management Pub Date : 2010-10-26 DOI:10.1145/1871437.1871696

Duo Zhang, Jimeng Sun, ChengXiang Zhai, A. Bose, Nikos Anerousis

引用次数: 7

Abstract

Many applications generate a large volume of parallel document collections. A parallel document collection consists of two sets of documents where the documents in each set correspond to each other and form semantic pairs (e.g., pairs of problem and solution descriptions in a help-desk setting). Although much work has been done on text mining, little previous work has attempted to mine such a novel kind of text data. In this paper, we propose a new probabilistic topic model, called Probabilistic Topic Mapping (PTM) model, to mine parallel document collections to simultaneously discover latent topics in both sets of documents as well as the mapping of topics in one set to those in the other. We evaluate the PTM model on one real parallel document collection in IT service domain. We show that PTM can effectively discover meaningful topics, as well as their mappings, and it's also useful for improving text matching and retrieval when there's a vocabulary gap.

查看原文本刊更多论文

PTM:挖掘并行文档集合的概率主题映射模型

许多应用程序生成大量并行文档集合。并行文档集合由两组文档组成，其中每组中的文档相互对应并形成语义对(例如，帮助台设置中的问题和解决方案描述对)。尽管在文本挖掘方面已经做了很多工作，但之前很少有工作尝试挖掘这种新颖的文本数据。本文提出了一种新的概率主题模型，即概率主题映射(PTM)模型，用于挖掘并行文档集合，以同时发现两组文档中的潜在主题以及一组文档中的主题到另一组文档中的主题的映射。我们在IT服务领域的一个实际并行文档集合上对PTM模型进行了评估。我们表明，PTM可以有效地发现有意义的主题及其映射，并且在存在词汇缺口时，它也有助于改进文本匹配和检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th ACM international conference on Information and knowledge management

自引率

0.00%

发文量