使用主题模型的口语文档检索

Proceedings of the 3rd International Universal Communication Symposium Pub Date : 2009-12-03 DOI:10.1145/1667780.1667862

Xinhui Hu, R. Isotani, Satoshi Nakamura

{"title":"使用主题模型的口语文档检索","authors":"Xinhui Hu, R. Isotani, Satoshi Nakamura","doi":"10.1145/1667780.1667862","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.","PeriodicalId":103128,"journal":{"name":"Proceedings of the 3rd International Universal Communication Symposium","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Spoken document retrieval using topic models\",\"authors\":\"Xinhui Hu, R. Isotani, Satoshi Nakamura\",\"doi\":\"10.1145/1667780.1667862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.\",\"PeriodicalId\":103128,\"journal\":{\"name\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1667780.1667862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Universal Communication Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1667780.1667862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一种基于非负矩阵分解(NMF)方法的文档主题模型(DTM)来探索自发语音文档检索。该模型使用潜在语义索引来检测文档中的底层语义关系。每个文档被解释为属于多个主题的生成主题模型。文档与查询的相关性由模型生成查询的概率表示。用于NMF的词-文档矩阵是由语音识别的n个最佳结果随机构建的，因此可以利用多个识别假设来补偿词识别误差。使用这种方法，在自发日语语料库(CSJ)的测试集上进行了实验，其中包含600多个小时的自发日语语音的39个查询。当维度或主题数超过一定阈值时，该模型的检索性能优于传统向量空间模型(VSM)。此外，无论是从检索性能还是主题表达能力方面，都验证了基于nmf的主题模型优于另一种基于奇异值分解(SVD)的潜在索引方法。本文还研究了该主题模型在多大程度上能够抵抗语音识别错误，这是语音文档检索中的一个特殊问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spoken document retrieval using topic models

In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 3rd International Universal Communication Symposium

自引率

0.00%

发文量