HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":null,"url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\nfrom the text domain to the speech domain. While developing TTS architectures\nthat train and test on the same set of speakers has seen significant\nimprovements, out-of-domain speaker performance still faces enormous\nlimitations. Domain adaptation on a new set of speakers can be achieved by\nfine-tuning the whole model for each new domain, thus making it\nparameter-inefficient. This problem can be solved by Adapters that provide a\nparameter-efficient alternative to domain adaptation. Although famous in NLP,\nspeech synthesis has not seen much improvement from Adapters. In this work, we\npresent HyperTTS, which comprises a small learnable network, \"hypernetwork\",\nthat generates parameters of the Adapter blocks, allowing us to condition\nAdapters on speaker representations and making them dynamic. Extensive\nevaluations of two domain adaptation settings demonstrate its effectiveness in\nachieving state-of-the-art performance in the parameter-efficient regime. We\nalso compare different variants of HyperTTS, comparing them with baselines in\ndifferent studies. Promising results on the dynamic adaptation of adapter\nparameters using hypernetworks open up new avenues for domain-generic\nmulti-speaker TTS systems. The audio samples and code are available at\nhttps://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"66 6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.04645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.
HyperTTS:利用超网络实现文本到语音的参数高效适配
神经语音合成或文本到语音(TTS)旨在将信号从文本域转换到语音域。虽然开发在同一组扬声器上进行训练和测试的 TTS 架构取得了显著进步,但域外扬声器的性能仍然面临巨大限制。要在一组新的扬声器上实现领域适应,就必须针对每个新领域对整个模型进行微调,从而使其参数效率低下。这个问题可以通过适配器来解决,它为领域适应提供了一种参数效率高的替代方案。虽然 Adapters 在 NLP 领域很有名,但语音合成领域还没有看到 Adapters 有什么改进。在这项工作中,我们提出了 HyperTTS,它由一个小型可学习网络 "超网络 "组成,可生成适配器模块的参数,从而使我们能够根据说话者的表征对适配器进行调节,并使其动态化。对两个领域适应性设置的广泛评估证明了它在参数效率机制中实现最先进性能的有效性。我们还比较了 HyperTTS 的不同变体,并将它们与其他研究的基线进行了比较。利用超网络对适配器参数进行动态调整的研究结果令人鼓舞,为领域通用多扬声器 TTS 系统开辟了新途径。音频样本和代码可在https://github.com/declare-lab/HyperTTS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信