Modeling multiword phrases with constrained phrase trees for improved topic modeling of conversational speech

2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI:10.1109/SLT.2012.6424226

Timothy J. Hazen, Fred Richardson

引用次数: 5

Abstract

Latent topic modeling has proven to be an effective means for learning the underlying semantic content within document collections. Latent topic modeling has traditionally been applied to bag-of-words representations that ignore word sequence information that can aid in semantic understanding. In this work we introduce a method for efficiently incorporating arbitrarily long word sequences into a topic modeling approach. This method iteratively constructs a constrained set of phrase trees in an unsupervised fashion from a document collection using weighted pointwise mutual information statistics to guide the process. In experiments on the Fisher Corpus of conversational speech, the incorporation of learned phrases into a latent topic model yielded significant improvements in the unsupervised discovery of the known topics present within the data.

查看原文本刊更多论文

基于约束短语树的多词短语建模，改进会话语音的主题建模

潜在主题建模已被证明是学习文档集合中潜在语义内容的有效方法。传统上，潜在主题建模被应用于忽略有助于语义理解的单词序列信息的词袋表示。在这项工作中，我们介绍了一种有效地将任意长词序列合并到主题建模方法中的方法。该方法使用加权的点向互信息统计来指导过程，以无监督的方式从文档集合中迭代构建约束的短语树集。在Fisher会话语音语料库的实验中，将学习到的短语合并到潜在主题模型中，在数据中存在的已知主题的无监督发现方面取得了显着改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量