Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-05-28 DOI:10.1186/s13636-024-00351-9

Zhiyong Chen, Zhiqi Ai, Youxuan Ma, Xinnuo Li, Shugong Xu

{"title":"Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis","authors":"Zhiyong Chen, Zhiqi Ai, Youxuan Ma, Xinnuo Li, Shugong Xu","doi":"10.1186/s13636-024-00351-9","DOIUrl":null,"url":null,"abstract":"In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/ , providing practical insights into its application and effectiveness.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"48 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00351-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/ , providing practical insights into its application and effectiveness.

查看原文本刊更多论文

优化特征融合，改进文本到语音合成中的零点适应性

先进的文本到语音（TTS）系统能够通过引用参考语音生成高保真的类人语音，在这个时代，语音克隆（VC）或零镜头 TTS（ZS-TTS）作为一项重要的子任务脱颖而出。VC 面临的一个主要挑战是，在特定说话人的参考数据有限的情况下，如何保持语音质量和说话人的相似性。然而，现有的语音识别系统通常依赖于嵌入式说话人矢量的天真组合来控制说话人，这就影响了对说话风格、语音印记和语义准确性的捕捉。为了克服这一问题，我们推出了双分支扬声器控制模块（TSCM），这是一种新颖且适应性强的语音克隆模块，旨在精确处理目标扬声器的扬声器或风格控制。我们的方法将来自门控卷积网络（GCN）的局部级特征和来自门控递归单元（GRU）的语篇级特征先进地融合在一起，以增强对说话人的控制。我们将 TSCM 集成到 FastSpeech 2 和 VITS 架构等先进的 TTS 系统中，显著优化了这些系统的性能，从而证明了 TSCM 的有效性。实验结果表明，通过对预先训练好的 TTS 模型进行零次或少量微调，TSCM 可以用最少的数据为目标说话者实现精确的语音克隆。此外，我们基于 TSCM 的 VITS（TSCM-VITS）与现有最先进的 VC 系统相比，即使在基本数据集配置的情况下，也能在零镜头场景中显示出卓越的性能。通过全面的主观和客观评估，我们的方法的优越性得到了验证。我们的系统演示可在 https://great-research.github.io/tsct-tts-demo/ 网站上获得，它提供了有关其应用和有效性的实用见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.