Cross-Utterance Conditioned VAE for Speech Generation

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-30 DOI:10.1109/TASLP.2024.3453598

Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun

{"title":"Cross-Utterance Conditioned VAE for Speech Generation","authors":"Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun","doi":"10.1109/TASLP.2024.3453598","DOIUrl":null,"url":null,"abstract":"Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4263-4276"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10699460/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

查看原文本刊更多论文

用于语音生成的交叉共振条件 VAE

由神经网络驱动的语音合成系统为多媒体制作带来了希望，但在生成有表现力的语音和无缝编辑方面经常面临问题。为此，我们提出了交叉均衡条件变异自动编码器语音合成（CUC-VAE S2）框架，以增强前音并确保自然语音的生成。该框架利用了预训练语言模型的强大表示能力和变异自动编码器（VAE）的重表达能力。CUC-VAE S2 框架的核心部分是跨口音 CVAE，它从周围的句子中提取声学、说话人和文本特征，生成上下文敏感的前音特征，从而更准确地模拟人类前音生成。我们还针对不同的语音合成应用提出了两种实用算法：用于文本到语音的 CUC-VAE TTS 和用于语音编辑的 CUC-VAE SE。CUC-VAE TTS 是该框架的直接应用，旨在生成带有从周围文本中提取的上下文前音的音频。另一方面，CUC-VAE SE 算法利用以上下文信息为条件的真实熔谱采样，生成与真实声音非常接近的音频，从而方便了基于文本的灵活语音编辑，如删除、插入和替换。在 LibriTTS 数据集上的实验结果表明，我们提出的模型显著增强了语音合成和编辑功能，生成的语音更自然、更具表现力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.