Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization

2013 IEEE International Conference on Acoustics, Speech and Signal Processing Pub Date : 2013-05-26 DOI:10.1109/ICASSP.2013.6639239

Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee

{"title":"Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization","authors":"Cheng-Tao Chung, Chun-an Chan, Lin-Shan Lee","doi":"10.1109/ICASSP.2013.6639239","DOIUrl":null,"url":null,"abstract":"Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2013.6639239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems.

查看原文本刊更多论文

语言结构的无监督发现，包括使用三个级联迭代优化阶段的两级声学模式

声学模式的无监督发现技术正变得越来越有吸引力，因为大量的语音数据变得可用，但人工注释仍然很难获得。在本文中，我们提出了一种基于原始语音数据的目标口语语言结构的无监督发现方法。这种语言结构包括两级(类子词和类词)声学模式、类子词模式的类词模式词典和基于类词模式的N-gram语言模型。所有的模式、模型和参数都可以从未标记的语音语料库中自动学习。这是通过初始化步骤实现的，然后是声学、语言和词汇迭代优化的三个级联阶段。类词模式词典为类词模式定义了允许的连续hmm序列。在每次迭代中，模型训练和解码产生更新的标签，从而可以进一步更新词典和hmm。这样，在每次迭代中分别优化模型参数和解码标签，并逐层逐步学习语言结构的知识。提出的方法在普通话广播新闻语料库上进行了初步实验，其中包括口语术语检测任务，其性能与使用监督方式训练的模型的并行测试相比。结果表明，所提出的系统不仅具有合理的性能，而且与现有的大词汇量ASR系统互补。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

自引率

0.00%

发文量