FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis

D. V. Sang, Lam Thu
{"title":"FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis","authors":"D. V. Sang, Lam Thu","doi":"10.1109/MAPR53640.2021.9585267","DOIUrl":null,"url":null,"abstract":"Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.","PeriodicalId":233540,"journal":{"name":"2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","volume":"381 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAPR53640.2021.9585267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW.
FastTacotron:一种快速、鲁棒和可控的语音合成方法
最近最先进的神经文本到语音合成模型显著提高了合成语音的质量。然而,以前的方法仍然存在一些问题。自回归模型存在推理速度慢的问题,而非自回归模型通常具有复杂、耗时和内存消耗大的训练管道。本文提出了一种新的模型FastTacotron,它是在ForwardTacotron的基础上改进的文本到语音转换方法。该模型采用循环Tacotron架构,但将其自回归关注部分替换为单个前向传递,以加快推理速度。该模型还将Tacotron中的注意机制替换为FastSpeech中的长度调节器,用于并行梅尔谱图生成。此外,我们引入了更多的语音韵律信息(如音调、能量和更准确的持续时间)作为条件输入,使持续时间预测器更加准确。实验表明,我们的模型在语音质量和推理速度方面与最先进的模型相匹配,在特别困难的情况下几乎消除了单词跳跃和重复的问题,并且可以控制生成的话语的速度和音高。更重要的是,我们的模型可以在几个小时的训练中收敛,这比现有方法快了11.2倍。此外,该模型的内存需求随序列长度线性增长,这使得该模型可以一次预测完整的文章。音频样本可以在https://bit.ly/3xguaCW找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信