Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2015-04-19 DOI:10.1109/ICASSP.2015.7178916

Jingjie Li, I. Mcloughlin, Cong Liu, Shaofei Xue, Si Wei

{"title":"Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition","authors":"Jingjie Li, I. Mcloughlin, Cong Liu, Shaofei Xue, Si Wei","doi":"10.1109/ICASSP.2015.7178916","DOIUrl":null,"url":null,"abstract":"This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.","PeriodicalId":117666,"journal":{"name":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2015.7178916","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.

查看原文本刊更多论文

基于区分说话人身份的多任务深度神经网络声学模型用于耳语识别

本文对大词汇量连续耳语自动识别进行了研究。wLVCSR提供了在公共场所使用ASR设备的能力，而不必担心打扰他人或泄露私人信息。然而，由于缺乏音高，低语声信号的信噪比远低于正常语音，而且会导致低语声频谱的平坦度和形成峰移位，因此，低语声信号的任务比普通的低语声信号更具挑战性。此外，可用于训练的耳语数据量远少于正常语音。本文采用多任务深度神经网络(DNN)声学模型来解决这些问题。此外，在多任务深度神经网络上进行模型自适应，基于判别性说话人身份信息对耳语中的说话人和环境变异性进行归一化。在普通话耳语听写任务中，使用55小时的耳语数据，所提出的SI多任务DNN模型比仅使用耳语数据进行判别训练的基线高斯混合模型(GMM)的字符错误率(CER)提高了56.7%。此外，该模型对正常语音的识别率可以达到15.2%，接近用1000小时语音数据训练的最先进深度神经网络的性能。从这个基线开始，适应模型的DNN比通用模型进一步降低了10.9%的CER。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量