HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

arXiv - CS - Sound Pub Date : 2024-04-06 DOI:arxiv-2404.04645

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria

{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":null,"url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\nfrom the text domain to the speech domain. While developing TTS architectures\nthat train and test on the same set of speakers has seen significant\nimprovements, out-of-domain speaker performance still faces enormous\nlimitations. Domain adaptation on a new set of speakers can be achieved by\nfine-tuning the whole model for each new domain, thus making it\nparameter-inefficient. This problem can be solved by Adapters that provide a\nparameter-efficient alternative to domain adaptation. Although famous in NLP,\nspeech synthesis has not seen much improvement from Adapters. In this work, we\npresent HyperTTS, which comprises a small learnable network, \"hypernetwork\",\nthat generates parameters of the Adapter blocks, allowing us to condition\nAdapters on speaker representations and making them dynamic. Extensive\nevaluations of two domain adaptation settings demonstrate its effectiveness in\nachieving state-of-the-art performance in the parameter-efficient regime. We\nalso compare different variants of HyperTTS, comparing them with baselines in\ndifferent studies. Promising results on the dynamic adaptation of adapter\nparameters using hypernetworks open up new avenues for domain-generic\nmulti-speaker TTS systems. The audio samples and code are available at\nhttps://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"66 6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.04645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

查看原文本刊更多论文

HyperTTS：利用超网络实现文本到语音的参数高效适配

神经语音合成或文本到语音（TTS）旨在将信号从文本域转换到语音域。虽然开发在同一组扬声器上进行训练和测试的 TTS 架构取得了显著进步，但域外扬声器的性能仍然面临巨大限制。要在一组新的扬声器上实现领域适应，就必须针对每个新领域对整个模型进行微调，从而使其参数效率低下。这个问题可以通过适配器来解决，它为领域适应提供了一种参数效率高的替代方案。虽然 Adapters 在 NLP 领域很有名，但语音合成领域还没有看到 Adapters 有什么改进。在这项工作中，我们提出了 HyperTTS，它由一个小型可学习网络 "超网络 "组成，可生成适配器模块的参数，从而使我们能够根据说话者的表征对适配器进行调节，并使其动态化。对两个领域适应性设置的广泛评估证明了它在参数效率机制中实现最先进性能的有效性。我们还比较了 HyperTTS 的不同变体，并将它们与其他研究的基线进行了比较。利用超网络对适配器参数进行动态调整的研究结果令人鼓舞，为领域通用多扬声器 TTS 系统开辟了新途径。音频样本和代码可在https://github.com/declare-lab/HyperTTS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量