基于模型的无监督口语词检测与口语查询

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-07-01 DOI:10.1109/TASL.2013.2248714

Chun-an Chan, Lin-Shan Lee

{"title":"基于模型的无监督口语词检测与口语查询","authors":"Chun-an Chan, Lin-Shan Lee","doi":"10.1109/TASL.2013.2248714","DOIUrl":null,"url":null,"abstract":"We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248714","citationCount":"31","resultStr":"{\"title\":\"Model-Based Unsupervised Spoken Term Detection with Spoken Queries\",\"authors\":\"Chun-an Chan, Lin-Shan Lee\",\"doi\":\"10.1109/TASL.2013.2248714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.\",\"PeriodicalId\":55014,\"journal\":{\"name\":\"IEEE Transactions on Audio Speech and Language Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TASL.2013.2248714\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Audio Speech and Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TASL.2013.2248714\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2248714","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

摘要

我们提出了一组基于模型的方法，用于无监督语音术语检测(STD)，该方法既不需要语音识别，也不需要注释数据。这项工作展示了从基于dtw的方法迁移到基于模型的无监督STD方法的可能性。提出的方法由三个组成部分组成:自组织模型、查询匹配和查询建模。为了构建自组织模型，重复模式被捕获并使用声学段模型(asm)建模。在查询匹配阶段，提出了一种文档状态匹配(DSM)方法，将文档表示为ASM序列，并与查询帧进行匹配。通过这种方式，asm不仅可以更好地模拟语音的信号分布和时间轨迹，而且与文档的帧相比，状态数量要少得多，因此计算负荷要低得多。针对上述匹配过程，提出了一种新的持续时间约束Viterbi (DC-Vite)算法来处理语音失真问题。在查询建模阶段，在伪相关反馈框架中提出了一种伪似然比(PLR)方法。使用使用伪相关/不相关示例训练的查询/反查询hmm评估的似然比用于验证检测到的口语术语假设。所提出的框架证明了ASM在零资源环境下对STD的有用性，以及使用ASM索引的即时响应STD系统的潜力。通过将基于dtw的方法集成到所提出的框架中的评分步骤中，可以实现最佳性能。实验结果表明，在中文广播新闻语料库上，与分段DTW方法相比，平均精度提高14.2%，CPU时间减少77%。在TIMIT和MediaEval 2011口语网络搜索语料库上发现了一致的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Model-Based Unsupervised Spoken Term Detection with Spoken Queries

We present a set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data. This work shows the possibilities in migrating from DTW-based to model-based approaches for unsupervised STD. The proposed approach consists of three components: self-organizing models, query matching, and query modeling. To construct the self-organizing models, repeated patterns are captured and modeled using acoustic segment models (ASMs). In the query matching phase, a document state matching (DSM) approach is proposed to represent documents as ASM sequences, which are matched to the query frames. In this way, not only do the ASMs better model the signal distributions and time trajectories of speech, but the much-smaller number of states than frames for the documents leads to a much lower computational load. A novel duration-constrained Viterbi (DC-Vite) algorithm is further proposed for the above matching process to handle the speaking rate distortion problem. In the query modeling phase, a pseudo likelihood ratio (PLR) approach is proposed in the pseudo relevance feedback (PRF) framework. A likelihood ratio evaluated with query/anti-query HMMs trained with pseudo relevant/irrelevant examples is used to verify the detected spoken term hypotheses. The proposed framework demonstrates the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing. The best performance is achieved by integrating DTW-based approaches into the rescoring steps in the proposed framework. Experimental results show an absolute 14.2% of mean average precision improvement with 77% CPU time reduction compared with the segmental DTW approach on a Mandarin broadcast news corpus. Consistent improvements were found on TIMIT and MediaEval 2011 Spoken Web Search corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.