Fengyu Yang, Shan Yang, Pengcheng Zhu, Pengju Yan, Lei Xie
{"title":"Improving Mandarin End-to-End Speech Synthesis by Self-Attention and Learnable Gaussian Bias","authors":"Fengyu Yang, Shan Yang, Pengcheng Zhu, Pengju Yan, Lei Xie","doi":"10.1109/ASRU46091.2019.9003949","DOIUrl":null,"url":null,"abstract":"Compared to conventional speech synthesis, end-to-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-to-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Compared to conventional speech synthesis, end-to-end speech synthesis has achieved much better naturalness with more simplified system building pipeline. End-to-end framework can generate natural speech directly from characters for English. But for other languages like Chinese, recent studies have indicated that extra engineering features are still needed for model robustness and naturalness, e.g, word boundaries and prosody boundaries, which makes the front-end pipeline as complicated as the traditional approach. To maintain the naturalness of generated speech and discard language-specific expertise as much as possible, in Mandarin TTS, we introduce a novel self-attention based encoder with learnable Gaussian bias in Tacotron. We evaluate different systems with and without complex prosody information and results show that the proposed approach has the ability to generate stable and natural speech with minimum language-dependent front-end modules.