StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.

Advances in neural information processing systems Pub Date : 2023-12-01 Epub Date: 2023-12-10

Yinghao Aaron Li, Cong Han, Vinay S Raghavan, Gavin Mischler, Nima Mesgarani

{"title":"StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.","authors":"Yinghao Aaron Li, Cong Han, Vinay S Raghavan, Gavin Mischler, Nima Mesgarani","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.</p>","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"36 ","pages":"19594-19621"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759097/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

本刊更多论文

StyleTTS 2：通过风格扩散和大型语音语言模型的对抗性训练实现人类水平的文本到语音。

在本文中，我们提出了StyleTTS 2，这是一个文本到语音（TTS）模型，它利用风格扩散和大型语音语言模型（SLMs）的对抗性训练来实现人类水平的TTS合成。StyleTTS 2与上一代的不同之处是，通过扩散模型将风格建模为潜在随机变量，在不需要参考语音的情况下生成最适合文本的风格，实现了高效的潜在扩散，同时受益于扩散模型提供的多样化语音合成。此外，我们使用大型预训练的slm（如WavLM）作为鉴别器，并使用我们新颖的可微分持续时间模型进行端到端训练，从而提高了语音的自然度。StyleTTS 2在单说话者LJSpeech数据集上超越了人类录音，并在多说话者VCTK数据集上与英语母语者相匹配。此外，当在LibriTTS数据集上训练时，我们的模型在零射击扬声器适应方面优于先前公开可用的模型。这项工作在单语和多语数据集上实现了第一个人类水平的TTS，展示了风格扩散和大型slm对抗性训练的潜力。音频演示和源代码可从https://styletts2.github.io/获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量