Latent Dirichlet learning for hierarchical segmentation

2012 IEEE International Workshop on Machine Learning for Signal Processing Pub Date : 2012-11-12 DOI:10.1109/MLSP.2012.6349772

Jen-Tzung Chien, C. Chueh

引用次数: 0

Abstract

Topic model can be established by using Dirichlet distributions as the prior model to characterize latent topics in natural language. However, topics in real-world stream data are non-stationary. Training a reliable topic model is a challenging study. Further, the usage of words in different paragraphs within a document is varied due to different composition styles. This study presents a hierarchical segmentation model by compensating the heterogeneous topics in stream level and the heterogeneous words in document level. The topic similarity between sentences is calculated to form a beta prior for stream-level segmentation. This segmentation prior is adopted to group topic-coherent sentences into a document. For each pseudo-document, we incorporate a Markov chain to detect stylistic segments within a document. The words in a segment are generated by identical composition style. This new model is inferred by a variational Bayesian EM procedure. Experimental results show benefits by using the proposed model in terms of perplexity and F measure.

查看原文本刊更多论文

层次分割的潜在狄利克雷学习

利用Dirichlet分布作为自然语言中潜在主题的先验模型，可以建立主题模型。然而，现实世界流数据中的主题是非平稳的。训练一个可靠的主题模型是一项具有挑战性的研究。此外，由于不同的组合风格，文档中不同段落中单词的用法也有所不同。本文提出了一种层次化的分词模型，在流级对异构主题进行补偿，在文档级对异构词进行补偿。计算句子之间的主题相似度，形成流级分割的beta先验。采用这种先验分割方法将主题连贯的句子分组到一个文档中。对于每个伪文档，我们结合一个马尔可夫链来检测文档中的风格片段。一个片段中的单词是由相同的组合样式生成的。这个新模型是由变分贝叶斯EM过程推断出来的。实验结果表明，该模型在模糊度和F测度方面具有较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE International Workshop on Machine Learning for Signal Processing

自引率

0.00%

发文量