FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis

2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR) Pub Date : 2021-10-01 DOI:10.1109/MAPR53640.2021.9585267

D. V. Sang, Lam Thu

{"title":"FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis","authors":"D. V. Sang, Lam Thu","doi":"10.1109/MAPR53640.2021.9585267","DOIUrl":null,"url":null,"abstract":"Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.","PeriodicalId":233540,"journal":{"name":"2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","volume":"381 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAPR53640.2021.9585267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.

查看原文本刊更多论文

FastTacotron:一种快速、鲁棒和可控的语音合成方法

最近最先进的神经文本到语音合成模型显著提高了合成语音的质量。然而，以前的方法仍然存在一些问题。自回归模型存在推理速度慢的问题，而非自回归模型通常具有复杂、耗时和内存消耗大的训练管道。本文提出了一种新的模型FastTacotron，它是在ForwardTacotron的基础上改进的文本到语音转换方法。该模型采用循环Tacotron架构，但将其自回归关注部分替换为单个前向传递，以加快推理速度。该模型还将Tacotron中的注意机制替换为FastSpeech中的长度调节器，用于并行梅尔谱图生成。此外，我们引入了更多的语音韵律信息(如音调、能量和更准确的持续时间)作为条件输入，使持续时间预测器更加准确。实验表明，我们的模型在语音质量和推理速度方面与最先进的模型相匹配，在特别困难的情况下几乎消除了单词跳跃和重复的问题，并且可以控制生成的话语的速度和音高。更重要的是，我们的模型可以在几个小时的训练中收敛，这比现有方法快了11.2倍。此外，该模型的内存需求随序列长度线性增长，这使得该模型可以一次预测完整的文章。音频样本可以在https://bit.ly/3xguaCW找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)

自引率

0.00%

发文量