Better human computer interaction by enhancing the quality of text-to-speech synthesis

2012 4th International Conference on Intelligent Human Computer Interaction (IHCI) Pub Date : 2012-12-01 DOI:10.1109/IHCI.2012.6481857

V. R. Reddy, K. S. Rao

引用次数: 5

Abstract

In this paper we propose high quality prosody models for enhancing the quality of text-to-speech (TTS) synthesis for providing better human computer interaction. In this study prosody refers to duration and intonation patterns of the sequence of syllables. In this work, prosody models are developed using feedforward neural networks, and prosodic information is predicted from linguistic and production constraints of syllables. The prediction accuracy of the proposed neural network based prosody models is compared objectively with Classification and Regression Tree based prosody models used by Festival. Subjective listening tests are also performed to evaluate the quality of the synthesized speech generated by incorporating the predicted prosodic features. From the evaluation studies, it is observed that prediction accuracy is better for neural network models, compared to other models.

查看原文本刊更多论文

通过提高文本到语音合成的质量，实现更好的人机交互

在本文中，我们提出了高质量的韵律模型来提高文本到语音(TTS)合成的质量，以提供更好的人机交互。在本研究中，韵律是指音节序列的持续时间和语调模式。在这项工作中，使用前馈神经网络开发韵律模型，并从音节的语言和生产约束中预测韵律信息。将神经网络韵律模型的预测精度与Festival基于分类和回归树的韵律模型进行了客观比较。主观听力测试也被执行，以评估质量的合成语音产生结合预测的韵律特征。从评价研究中可以看出，与其他模型相比，神经网络模型的预测精度更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 4th International Conference on Intelligent Human Computer Interaction (IHCI)

自引率

0.00%

发文量