层次分割的潜在狄利克雷学习

2012 IEEE International Workshop on Machine Learning for Signal Processing Pub Date : 2012-11-12 DOI:10.1109/MLSP.2012.6349772

Jen-Tzung Chien, C. Chueh

{"title":"层次分割的潜在狄利克雷学习","authors":"Jen-Tzung Chien, C. Chueh","doi":"10.1109/MLSP.2012.6349772","DOIUrl":null,"url":null,"abstract":"Topic model can be established by using Dirichlet distributions as the prior model to characterize latent topics in natural language. However, topics in real-world stream data are non-stationary. Training a reliable topic model is a challenging study. Further, the usage of words in different paragraphs within a document is varied due to different composition styles. This study presents a hierarchical segmentation model by compensating the heterogeneous topics in stream level and the heterogeneous words in document level. The topic similarity between sentences is calculated to form a beta prior for stream-level segmentation. This segmentation prior is adopted to group topic-coherent sentences into a document. For each pseudo-document, we incorporate a Markov chain to detect stylistic segments within a document. The words in a segment are generated by identical composition style. This new model is inferred by a variational Bayesian EM procedure. Experimental results show benefits by using the proposed model in terms of perplexity and F measure.","PeriodicalId":262601,"journal":{"name":"2012 IEEE International Workshop on Machine Learning for Signal Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Latent Dirichlet learning for hierarchical segmentation\",\"authors\":\"Jen-Tzung Chien, C. Chueh\",\"doi\":\"10.1109/MLSP.2012.6349772\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic model can be established by using Dirichlet distributions as the prior model to characterize latent topics in natural language. However, topics in real-world stream data are non-stationary. Training a reliable topic model is a challenging study. Further, the usage of words in different paragraphs within a document is varied due to different composition styles. This study presents a hierarchical segmentation model by compensating the heterogeneous topics in stream level and the heterogeneous words in document level. The topic similarity between sentences is calculated to form a beta prior for stream-level segmentation. This segmentation prior is adopted to group topic-coherent sentences into a document. For each pseudo-document, we incorporate a Markov chain to detect stylistic segments within a document. The words in a segment are generated by identical composition style. This new model is inferred by a variational Bayesian EM procedure. Experimental results show benefits by using the proposed model in terms of perplexity and F measure.\",\"PeriodicalId\":262601,\"journal\":{\"name\":\"2012 IEEE International Workshop on Machine Learning for Signal Processing\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE International Workshop on Machine Learning for Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MLSP.2012.6349772\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Workshop on Machine Learning for Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2012.6349772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

利用Dirichlet分布作为自然语言中潜在主题的先验模型，可以建立主题模型。然而，现实世界流数据中的主题是非平稳的。训练一个可靠的主题模型是一项具有挑战性的研究。此外，由于不同的组合风格，文档中不同段落中单词的用法也有所不同。本文提出了一种层次化的分词模型，在流级对异构主题进行补偿，在文档级对异构词进行补偿。计算句子之间的主题相似度，形成流级分割的beta先验。采用这种先验分割方法将主题连贯的句子分组到一个文档中。对于每个伪文档，我们结合一个马尔可夫链来检测文档中的风格片段。一个片段中的单词是由相同的组合样式生成的。这个新模型是由变分贝叶斯EM过程推断出来的。实验结果表明，该模型在模糊度和F测度方面具有较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Latent Dirichlet learning for hierarchical segmentation

Topic model can be established by using Dirichlet distributions as the prior model to characterize latent topics in natural language. However, topics in real-world stream data are non-stationary. Training a reliable topic model is a challenging study. Further, the usage of words in different paragraphs within a document is varied due to different composition styles. This study presents a hierarchical segmentation model by compensating the heterogeneous topics in stream level and the heterogeneous words in document level. The topic similarity between sentences is calculated to form a beta prior for stream-level segmentation. This segmentation prior is adopted to group topic-coherent sentences into a document. For each pseudo-document, we incorporate a Markov chain to detect stylistic segments within a document. The words in a segment are generated by identical composition style. This new model is inferred by a variational Bayesian EM procedure. Experimental results show benefits by using the proposed model in terms of perplexity and F measure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE International Workshop on Machine Learning for Signal Processing

自引率

0.00%

发文量