CMDF-TTS：目标说话人语料库有限的文本转语音方法

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-04-12 DOI:10.1016/j.neunet.2025.107432

Ye Tao , Jiawang Liu , Chaofeng Lu , Meng Liu , Xiugong Qin , Yunlong Tian , Yongjie Du

{"title":"CMDF-TTS：目标说话人语料库有限的文本转语音方法","authors":"Ye Tao , Jiawang Liu , Chaofeng Lu , Meng Liu , Xiugong Qin , Yunlong Tian , Yongjie Du","doi":"10.1016/j.neunet.2025.107432","DOIUrl":null,"url":null,"abstract":"<div><div>While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <<em>text, speech</em>> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker’s characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech’s quality. Experimental results on multiple Mandarin and English corpus demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"188 ","pages":"Article 107432"},"PeriodicalIF":6.0000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMDF-TTS: Text-to-speech method with limited target speaker corpus\",\"authors\":\"Ye Tao , Jiawang Liu , Chaofeng Lu , Meng Liu , Xiugong Qin , Yunlong Tian , Yongjie Du\",\"doi\":\"10.1016/j.neunet.2025.107432\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <<em>text, speech</em>> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker’s characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech’s quality. Experimental results on multiple Mandarin and English corpus demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"188 \",\"pages\":\"Article 107432\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025003119\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003119","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

虽然目标说话人语料库有限的端到端文本到语音（TTS）方法可以生成高质量的语音，但它们通常需要一个包含大量文本、语音的非目标说话人语料库（辅助语料库）。对模型进行配对训练，显著增加了训练成本。在这项工作中，我们提出了一种快速、高质量的语音合成方法，需要很少的目标说话者录音。基于统计学分析了语料库中音素、虚词和话语目标域的作用，提出了一种基于统计的压缩辅助语料库算法（SCAC）。它显著提高了模型训练速度，而没有明显降低语音的自然度。接下来，我们使用压缩的语料库来训练提出的非自回归模型CMDF-TTS，该模型使用多级韵律建模模块来获取更多信息，并使用去噪扩散概率模型（ddpm）来生成mel-谱图。此外，我们利用目标说话人语料库对模型进行精细调整，将说话人的特征嵌入到模型中，并利用条件变分自编码器生成对抗网络（CVAE-GAN）进一步提高合成语音的质量。在多个汉语和英语语料库上的实验结果表明，经SCAC算法增强的CMDF-TTS模型能够有效地平衡训练速度和合成语音质量。总的来说，它的性能超过了最先进的车型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CMDF-TTS: Text-to-speech method with limited target speaker corpus

While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <text, speech> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker’s characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech’s quality. Experimental results on multiple Mandarin and English corpus demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.