使用主题模型的口语文档检索

Xinhui Hu, R. Isotani, Satoshi Nakamura
{"title":"使用主题模型的口语文档检索","authors":"Xinhui Hu, R. Isotani, Satoshi Nakamura","doi":"10.1145/1667780.1667862","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.","PeriodicalId":103128,"journal":{"name":"Proceedings of the 3rd International Universal Communication Symposium","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Spoken document retrieval using topic models\",\"authors\":\"Xinhui Hu, R. Isotani, Satoshi Nakamura\",\"doi\":\"10.1145/1667780.1667862\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.\",\"PeriodicalId\":103128,\"journal\":{\"name\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1667780.1667862\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Universal Communication Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1667780.1667862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

本文提出了一种基于非负矩阵分解(NMF)方法的文档主题模型(DTM)来探索自发语音文档检索。该模型使用潜在语义索引来检测文档中的底层语义关系。每个文档被解释为属于多个主题的生成主题模型。文档与查询的相关性由模型生成查询的概率表示。用于NMF的词-文档矩阵是由语音识别的n个最佳结果随机构建的,因此可以利用多个识别假设来补偿词识别误差。使用这种方法,在自发日语语料库(CSJ)的测试集上进行了实验,其中包含600多个小时的自发日语语音的39个查询。当维度或主题数超过一定阈值时,该模型的检索性能优于传统向量空间模型(VSM)。此外,无论是从检索性能还是主题表达能力方面,都验证了基于nmf的主题模型优于另一种基于奇异值分解(SVD)的潜在索引方法。本文还研究了该主题模型在多大程度上能够抵抗语音识别错误,这是语音文档检索中的一个特殊问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Spoken document retrieval using topic models
In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信