通过说话人选择和环境变换实现鲁棒语音识别

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2012-03-25 DOI:10.1109/ICASSP.2012.6288878

Raghavendra Bilgi, Vikas Joshi, S. Umesh, Luz García, M. C. Benítez

{"title":"通过说话人选择和环境变换实现鲁棒语音识别","authors":"Raghavendra Bilgi, Vikas Joshi, S. Umesh, Luz García, M. C. Benítez","doi":"10.1109/ICASSP.2012.6288878","DOIUrl":null,"url":null,"abstract":"In this paper, we address the problem of robustness to both noise and speaker-variability in automatic speech recognition (ASR). We propose the use of pre-computed Noise and Speaker transforms, and an optimal combination of these two transforms are chosen during test using maximum-likelihood (ML) criterion. These pre-computed transforms are obtained during training by using data obtained from different noise conditions that are usually encountered for that particular ASR task. The environment transforms are obtained during training using constrained-MLLR (CMLLR) framework, while for speaker-transforms we use the analytically determined linear-VTLN matrices. Even though the exact noise environment may not be encountered during test, the ML-based choice of the closest Environment transform provides “sufficient” cleaning and this is corroborated by experimental results with performance comparable to histogram equalization or Vector Taylor Series approaches on Aurora-2 task. The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"36 1","pages":"4333-4336"},"PeriodicalIF":0.0000,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust speech recognition through selection of speaker and environment transforms\",\"authors\":\"Raghavendra Bilgi, Vikas Joshi, S. Umesh, Luz García, M. C. Benítez\",\"doi\":\"10.1109/ICASSP.2012.6288878\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we address the problem of robustness to both noise and speaker-variability in automatic speech recognition (ASR). We propose the use of pre-computed Noise and Speaker transforms, and an optimal combination of these two transforms are chosen during test using maximum-likelihood (ML) criterion. These pre-computed transforms are obtained during training by using data obtained from different noise conditions that are usually encountered for that particular ASR task. The environment transforms are obtained during training using constrained-MLLR (CMLLR) framework, while for speaker-transforms we use the analytically determined linear-VTLN matrices. Even though the exact noise environment may not be encountered during test, the ML-based choice of the closest Environment transform provides “sufficient” cleaning and this is corroborated by experimental results with performance comparable to histogram equalization or Vector Taylor Series approaches on Aurora-2 task. The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.\",\"PeriodicalId\":6443,\"journal\":{\"name\":\"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"36 1\",\"pages\":\"4333-4336\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2012.6288878\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2012.6288878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们讨论了自动语音识别(ASR)中对噪声和说话人变异性的鲁棒性问题。我们建议使用预先计算的噪声和扬声器变换，并在测试中使用最大似然(ML)标准选择这两个变换的最佳组合。这些预先计算的变换是在训练过程中通过使用从特定ASR任务通常遇到的不同噪声条件中获得的数据获得的。环境变换在训练过程中使用约束- mllr (cllr)框架，而说话人变换则使用解析确定的线性- vtln矩阵。尽管在测试过程中可能不会遇到确切的噪声环境，但基于ml的最接近环境变换的选择提供了“充分”的清洁，并且实验结果证实了这一点，其性能可与Aurora-2任务上的直方图均衡或矢量泰勒级数方法相媲美。该方法简单，因为它只涉及预先计算的环境和扬声器变换的选择，因此，与许多其他扬声器和噪声补偿方法不同，它可以应用于很少的测试数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust speech recognition through selection of speaker and environment transforms

In this paper, we address the problem of robustness to both noise and speaker-variability in automatic speech recognition (ASR). We propose the use of pre-computed Noise and Speaker transforms, and an optimal combination of these two transforms are chosen during test using maximum-likelihood (ML) criterion. These pre-computed transforms are obtained during training by using data obtained from different noise conditions that are usually encountered for that particular ASR task. The environment transforms are obtained during training using constrained-MLLR (CMLLR) framework, while for speaker-transforms we use the analytically determined linear-VTLN matrices. Even though the exact noise environment may not be encountered during test, the ML-based choice of the closest Environment transform provides “sufficient” cleaning and this is corroborated by experimental results with performance comparable to histogram equalization or Vector Taylor Series approaches on Aurora-2 task. The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量