Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition

Jingjie Li, I. Mcloughlin, Cong Liu, Shaofei Xue, Si Wei
{"title":"Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition","authors":"Jingjie Li, I. Mcloughlin, Cong Liu, Shaofei Xue, Si Wei","doi":"10.1109/ICASSP.2015.7178916","DOIUrl":null,"url":null,"abstract":"This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.","PeriodicalId":117666,"journal":{"name":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2015.7178916","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.
基于区分说话人身份的多任务深度神经网络声学模型用于耳语识别
本文对大词汇量连续耳语自动识别进行了研究。wLVCSR提供了在公共场所使用ASR设备的能力,而不必担心打扰他人或泄露私人信息。然而,由于缺乏音高,低语声信号的信噪比远低于正常语音,而且会导致低语声频谱的平坦度和形成峰移位,因此,低语声信号的任务比普通的低语声信号更具挑战性。此外,可用于训练的耳语数据量远少于正常语音。本文采用多任务深度神经网络(DNN)声学模型来解决这些问题。此外,在多任务深度神经网络上进行模型自适应,基于判别性说话人身份信息对耳语中的说话人和环境变异性进行归一化。在普通话耳语听写任务中,使用55小时的耳语数据,所提出的SI多任务DNN模型比仅使用耳语数据进行判别训练的基线高斯混合模型(GMM)的字符错误率(CER)提高了56.7%。此外,该模型对正常语音的识别率可以达到15.2%,接近用1000小时语音数据训练的最先进深度神经网络的性能。从这个基线开始,适应模型的DNN比通用模型进一步降低了10.9%的CER。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信