USAT:通用演讲者自适应文本到语音方法

Wenbin Wang, Yang Song, Sanjay Jha
{"title":"USAT:通用演讲者自适应文本到语音方法","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":null,"url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\nenhancing the quality of synthesized speech for speakers in the training\ndataset. The challenge of synthesizing lifelike speech for unseen,\nout-of-dataset speakers, especially those with limited reference data, remains\na significant and unresolved problem. While zero-shot or few-shot\nspeaker-adaptive TTS approaches have been explored, they have many limitations.\nZero-shot approaches tend to suffer from insufficient generalization\nperformance to reproduce the voice of speakers with heavy accents. While\nfew-shot methods can reproduce highly varying accents, they bring a significant\nstorage burden and the risk of overfitting and catastrophic forgetting. In\naddition, prior approaches only provide either zero-shot or few-shot\nadaptation, constraining their utility across varied real-world scenarios with\ndifferent demands. Besides, most current evaluations of speaker-adaptive TTS\nare conducted only on datasets of native speakers, inadvertently neglecting a\nvast portion of non-native speakers with diverse accents. Our proposed\nframework unifies both zero-shot and few-shot speaker adaptation strategies,\nwhich we term as \"instant\" and \"fine-grained\" adaptations based on their\nmerits. To alleviate the insufficient generalization performance observed in\nzero-shot speaker adaptation, we designed two innovative discriminators and\nintroduced a memory mechanism for the speech decoder. To prevent catastrophic\nforgetting and reduce storage implications for few-shot speaker adaptation, we\ndesigned two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach\",\"authors\":\"Wenbin Wang, Yang Song, Sanjay Jha\",\"doi\":\"arxiv-2404.18094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional text-to-speech (TTS) research has predominantly focused on\\nenhancing the quality of synthesized speech for speakers in the training\\ndataset. The challenge of synthesizing lifelike speech for unseen,\\nout-of-dataset speakers, especially those with limited reference data, remains\\na significant and unresolved problem. While zero-shot or few-shot\\nspeaker-adaptive TTS approaches have been explored, they have many limitations.\\nZero-shot approaches tend to suffer from insufficient generalization\\nperformance to reproduce the voice of speakers with heavy accents. While\\nfew-shot methods can reproduce highly varying accents, they bring a significant\\nstorage burden and the risk of overfitting and catastrophic forgetting. In\\naddition, prior approaches only provide either zero-shot or few-shot\\nadaptation, constraining their utility across varied real-world scenarios with\\ndifferent demands. Besides, most current evaluations of speaker-adaptive TTS\\nare conducted only on datasets of native speakers, inadvertently neglecting a\\nvast portion of non-native speakers with diverse accents. Our proposed\\nframework unifies both zero-shot and few-shot speaker adaptation strategies,\\nwhich we term as \\\"instant\\\" and \\\"fine-grained\\\" adaptations based on their\\nmerits. To alleviate the insufficient generalization performance observed in\\nzero-shot speaker adaptation, we designed two innovative discriminators and\\nintroduced a memory mechanism for the speech decoder. To prevent catastrophic\\nforgetting and reduce storage implications for few-shot speaker adaptation, we\\ndesigned two adapters and a unique adaptation procedure.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2404.18094\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.18094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

传统的文本到语音(TTS)研究主要集中在提高训练数据集中说话人的合成语音质量上。为未见过、数据集外的说话者,尤其是参考数据有限的说话者合成栩栩如生的语音,仍然是一个尚未解决的重大难题。虽然人们已经探索了零镜头或少镜头扬声器自适应 TTS 方法,但它们有很多局限性。零镜头方法往往存在泛化性能不足的问题,无法重现重口音扬声器的声音。虽然少镜头方法可以重现变化很大的口音,但它们带来了巨大的存储负担,以及过度拟合和灾难性遗忘的风险。此外,先前的方法只能提供零镜头或少镜头适应,这限制了它们在不同需求的真实世界场景中的实用性。此外,目前对说话者自适应 TTS 的评估大多只在母语使用者的数据集上进行,无意中忽略了口音各异的非母语使用者。我们提出的框架统一了 "零镜头 "和 "少镜头 "说话者适应策略,并根据它们的优点将其分别称为 "即时 "和 "细粒度 "适应。为了缓解零镜头说话人适配中观察到的泛化性能不足的问题,我们设计了两个创新的判别器,并为语音解码器引入了记忆机制。为了防止灾难性遗忘,并减少存储对少量说话人适配的影响,我们设计了两个适配器和一个独特的适配程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信