Phoneme-level speaking rate variation on waveform generation using GAN-TTS

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA) Pub Date : 2019-10-01 DOI:10.1109/O-COCOSDA46868.2019.9060845

Mayuko Okamato, S. Sakti, Satoshi Nakamura

引用次数: 2

Abstract

The development of text-to-speech synthesis (TTS) systems continues to advance, and the naturalness of their generated speech has significantly improved. But most TTS systems now learn from data using a deep learning framework and generate the output at a monotonous speaking rate. In contrast humans vary their speaking rates and tend to slow down to emphasize words to distinguish elements of focus in an utterance.

查看原文本刊更多论文

基于GAN-TTS的音素级说话速率变化波形生成

文本到语音合成(TTS)系统的发展不断推进，其生成语音的自然度有了显著提高。但大多数TTS系统现在使用深度学习框架从数据中学习，并以单调的语速生成输出。相比之下，人类会改变语速，并倾向于放慢语速来强调单词，以区分话语中的重点元素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

自引率

0.00%

发文量