Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268927

M. Mimura, S. Sakai, Tatsuya Kawahara

{"title":"Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks","authors":"M. Mimura, S. Sakai, Tatsuya Kawahara","doi":"10.1109/ASRU.2017.8268927","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition (ASR) systems often does not perform well when it is used in a different acoustic domain from the training time, such as utterances spoken in noisy environments or in different speaking styles. We propose a novel approach to cross-domain speech recognition based on acoustic feature mappings provided by a deep neural network, which is trained using nonparallel speech corpora from two different domains and using no phone labels. For training a target domain acoustic model, we generate “fake” target speech features from the labeleld source domain features using a mapping Gf. We can also generate “fake” source features for testing from the target features using the backward mapping Gb which has been learned simultaneously with G f. The mappings G f and Gb are trained as adversarial networks using a conventional adversarial loss and a cycle-consistency loss criterion that encourages the backward mapping to bring the translated feature back to the original as much as possible such that Gb(Gf (x)) ≈ x. In a highly challenging task of model adaptation only using domain speech features, our method achieved up to 16 % relative improvements in WER in the evaluation using the CHiME3 real test data. The backward mapping was also confirmed to be effective with a speaking style adaptation task.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

Automatic speech recognition (ASR) systems often does not perform well when it is used in a different acoustic domain from the training time, such as utterances spoken in noisy environments or in different speaking styles. We propose a novel approach to cross-domain speech recognition based on acoustic feature mappings provided by a deep neural network, which is trained using nonparallel speech corpora from two different domains and using no phone labels. For training a target domain acoustic model, we generate “fake” target speech features from the labeleld source domain features using a mapping Gf. We can also generate “fake” source features for testing from the target features using the backward mapping Gb which has been learned simultaneously with G f. The mappings G f and Gb are trained as adversarial networks using a conventional adversarial loss and a cycle-consistency loss criterion that encourages the backward mapping to bring the translated feature back to the original as much as possible such that Gb(Gf (x)) ≈ x. In a highly challenging task of model adaptation only using domain speech features, our method achieved up to 16 % relative improvements in WER in the evaluation using the CHiME3 real test data. The backward mapping was also confirmed to be effective with a speaking style adaptation task.

查看原文本刊更多论文

基于周期一致对抗网络的非并行语料库跨域语音识别

自动语音识别(ASR)系统在与训练时间不同的声学域中使用时，通常表现不佳，例如在嘈杂环境中或以不同的说话风格说话。本文提出了一种基于深度神经网络声学特征映射的跨域语音识别新方法，该方法使用来自两个不同领域的非并行语音语料库和不使用电话标签进行训练。为了训练目标域声学模型，我们使用映射Gf从标记的源域特征生成“假”目标语音特征。我们也可以产生“假”源特性的测试目标特性使用反向映射Gb已学会了与G f。同时映射G f和G训练对抗网络使用传统的敌对的损失和cycle-consistency损失标准,鼓励向后映射把翻译功能尽可能回到最初,Gb (Gf (x))≈x。在一个高度挑战性的任务模型只适应使用域在使用CHiME3真实测试数据进行评估时，我们的方法在相对噪声噪声方面取得了16%的提高。反向映射也被证实对口语风格适应任务有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量