Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003956

T. Okamoto, T. Toda, Y. Shiga, H. Kawai

{"title":"Tacotron-Based Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems","authors":"T. Okamoto, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/ASRU46091.2019.9003956","DOIUrl":null,"url":null,"abstract":"Although sequence-to-sequence (seq2seq) models with attention mechanism in neural text-to-speech (TTS) systems, such as Tacotron 2, can jointly optimize duration and acoustic models, and realize high-fidelity synthesis compared with conventional duration-acoustic pipeline models, these involve a risk that speech samples cannot be sometimes successfully synthesized due to the attention prediction errors. Therefore, these seq2seq models cannot be directly introduced in practical TTS systems. On the other hand, the conventional pipeline models are broadly used in practical TTS systems since there are few crucial prediction errors in the duration model. For realizing high-quality practical TTS systems without attention prediction errors, this paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention. The phoneme durations are first obtained from HMM-based forced alignment and the duration model is a simple bidirectional LSTM-based network. Then, a seq2seq model with forced alignment instead of attention is investigated and an alternative model with Tacotron decoder and phoneme duration is proposed. The results of experiments with full-context label input using WaveGlow vocoder indicate that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Although sequence-to-sequence (seq2seq) models with attention mechanism in neural text-to-speech (TTS) systems, such as Tacotron 2, can jointly optimize duration and acoustic models, and realize high-fidelity synthesis compared with conventional duration-acoustic pipeline models, these involve a risk that speech samples cannot be sometimes successfully synthesized due to the attention prediction errors. Therefore, these seq2seq models cannot be directly introduced in practical TTS systems. On the other hand, the conventional pipeline models are broadly used in practical TTS systems since there are few crucial prediction errors in the duration model. For realizing high-quality practical TTS systems without attention prediction errors, this paper investigates Tacotron-based acoustic models with phoneme alignment instead of attention. The phoneme durations are first obtained from HMM-based forced alignment and the duration model is a simple bidirectional LSTM-based network. Then, a seq2seq model with forced alignment instead of attention is investigated and an alternative model with Tacotron decoder and phoneme duration is proposed. The results of experiments with full-context label input using WaveGlow vocoder indicate that the proposed model can realize a high-fidelity TTS system for Japanese with a real-time factor of 0.13 using a GPU without attention prediction errors compared with the seq2seq models.

查看原文本刊更多论文

基于tacotron的音位对齐声学模型在实用神经文本-语音系统中的应用

虽然Tacotron 2等神经文本到语音(TTS)系统中具有注意机制的序列到序列(sequence-to-sequence, seq2seq)模型与传统的时-声管道模型相比，可以联合优化时-声模型，实现高保真合成，但存在由于注意预测误差导致语音样本有时无法成功合成的风险。因此，这些seq2seq模型不能直接引入实际的TTS系统中。另一方面，传统的管道模型在实际TTS系统中得到了广泛的应用，因为其持续时间模型的关键预测误差很小。为了实现无注意力预测误差的高质量实用TTS系统，本文研究了基于tacotron的音素对齐代替注意力的声学模型。音素持续时间首先通过基于hmm的强制对齐得到，持续时间模型是一个简单的基于lstm的双向网络。在此基础上，研究了一种基于强制对齐而非注意力的seq2seq模型，并提出了一种基于Tacotron解码器和音素持续时间的替代模型。使用WaveGlow声码器进行全语境标签输入的实验结果表明，与seq2seq模型相比，该模型可以实现高保真的日语TTS系统，实时性为0.13，且无注意力预测误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量