M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara
{"title":"Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition","authors":"M. Mimura, Sei Ueno, H. Inaguma, S. Sakai, Tatsuya Kawahara","doi":"10.1109/SLT.2018.8639589","DOIUrl":null,"url":null,"abstract":"Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness com- parable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41
Abstract
Encoder-decoder models for acoustic-to-word (A2W) automatic speech recognition (ASR) are attractive for their simplicity of architecture and run-time latency while achieving state-of-the-art performances. However, word-based models commonly suffer from the-of-vocabulary (OOV) word problem. They also cannot leverage text data to improve their language modeling capability. Recently, sequence-to-sequence neural speech synthesis models trainable from corpora have been developed and shown to achieve naturalness com- parable to recorded human speech. In this paper, we explore how we can leverage the current speech synthesis technology to tailor the ASR system for a target domain by preparing only a relevant text corpus. From a set of target domain texts, we generate speech features using a sequence-to-sequence speech synthesizer. These artificial speech features together with real speech features from conventional speech corpora are used to train an attention-based A2W model. Experimental results show that the proposed approach improves the word accuracy significantly compared to the baseline trained only with the real speech, although synthetic part of the training data comes only from a single female speaker voice.