Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI:arxiv-2409.11915

Anusha Prakash, Hema A Murthy

引用次数: 0

Abstract

Sentences in Indian languages are generally longer than those in English. Indian languages are also considered to be phrase-based, wherein semantically complete phrases are concatenated to make up sentences. Long utterances lead to poor training of text-to-speech models and result in poor prosody during synthesis. In this work, we explore an inter-pausal unit (IPU) based approach in the end-to-end (E2E) framework, focusing on synthesising conversational-style text. We consider both autoregressive Tacotron2 and non-autoregressive FastSpeech2 architectures in our study and perform experiments with three Indian languages, namely, Hindi, Tamil and Telugu. With the IPU-based Tacotron2 approach, we see a reduction in insertion and deletion errors in the synthesised audio, providing an alternative approach to the FastSpeech(2) network in terms of error reduction. The IPU-based approach requires less computational resources and produces prosodically richer synthesis compared to conventional sentence-based systems.

查看原文本刊更多论文

探索基于因果单元（IPU）的端到端智能语音识别系统方法

印度语言的句子通常比英语的句子长。印度语言也被认为是以短语为基础的语言，语义完整的短语被连接起来构成句子。长语句导致文本到语音模型的训练时间过长，并造成合成时的前音不佳。在这项工作中，我们在端到端（E2E）框架内探索了一种基于停顿间单元（IPU）的方法，重点是合成对话式文本。我们在研究中考虑了自回归 Tacotron2 和非自回归 FastSpeech2 架构，并对三种印度语言（印地语、泰米尔语和泰卢固语）进行了实验。通过基于 IPU 的 Tacotron2 方法，我们发现合成音频中的插入和删除错误有所减少，在减少错误方面为 FastSpeech(2) 网络提供了一种替代方法。与传统的基于句子的系统相比，基于 IPU 的方法所需的计算资源更少，合成的前音也更丰富。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量