Modeling vowel duration for Japanese text-to-speech synthesis

5th International Conference on Spoken Language Processing (ICSLP 1998) Pub Date : 1998-11-30 DOI:10.21437/ICSLP.1998-46

J. Venditti, J. V. Santen

引用次数: 11

Abstract

Accurate estimation of segmental durations is crucial for natural-sounding text-to-speech (TTS) synthesis. This paper presents a model of vowel duration used in the Bell Labs JapaneseTTS system. We describe the constraints on vowel devoicing, and effects of factors such as phone identity, surrounding phone identities, accentuation, syllabic structure, and phrasal position on the duration of both long and short vowels. A Sum-of-Products ap-proach is used to model key interactions observed in the data, and to predict values of factor combinations not found in the speech database. We report root mean squared deviations between observed and predicted durations ranging from 8 to 15 ms, and an overall correlation of 0.89. in Tokyo Japanese read speech for in Labs JapaneseTTS

查看原文本刊更多论文

日语文本到语音合成的元音持续时间建模

片段持续时间的准确估计是自然声音文本到语音(TTS)合成的关键。本文提出了一个用于贝尔实验室日语系统的元音音长模型。我们描述了元音发声的制约因素，以及诸如电话身份、周围电话身份、重音、音节结构和短语位置等因素对长、短元音持续时间的影响。使用产品和方法对数据中观察到的关键交互进行建模，并预测语音数据库中未发现的因素组合的值。我们报告了观察到的持续时间和预测的持续时间之间的均方根偏差，范围从8到15毫秒，总体相关性为0.89。在东京，日本人为实验室的日本人朗读演讲

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

5th International Conference on Spoken Language Processing (ICSLP 1998)

自引率

0.00%

发文量