{"title":"USAT: A Universal Speaker-Adaptive Text-to-Speech Approach","authors":"Wenbin Wang, Yang Song, Sanjay Jha","doi":"arxiv-2404.18094","DOIUrl":null,"url":null,"abstract":"Conventional text-to-speech (TTS) research has predominantly focused on\nenhancing the quality of synthesized speech for speakers in the training\ndataset. The challenge of synthesizing lifelike speech for unseen,\nout-of-dataset speakers, especially those with limited reference data, remains\na significant and unresolved problem. While zero-shot or few-shot\nspeaker-adaptive TTS approaches have been explored, they have many limitations.\nZero-shot approaches tend to suffer from insufficient generalization\nperformance to reproduce the voice of speakers with heavy accents. While\nfew-shot methods can reproduce highly varying accents, they bring a significant\nstorage burden and the risk of overfitting and catastrophic forgetting. In\naddition, prior approaches only provide either zero-shot or few-shot\nadaptation, constraining their utility across varied real-world scenarios with\ndifferent demands. Besides, most current evaluations of speaker-adaptive TTS\nare conducted only on datasets of native speakers, inadvertently neglecting a\nvast portion of non-native speakers with diverse accents. Our proposed\nframework unifies both zero-shot and few-shot speaker adaptation strategies,\nwhich we term as \"instant\" and \"fine-grained\" adaptations based on their\nmerits. To alleviate the insufficient generalization performance observed in\nzero-shot speaker adaptation, we designed two innovative discriminators and\nintroduced a memory mechanism for the speech decoder. To prevent catastrophic\nforgetting and reduce storage implications for few-shot speaker adaptation, we\ndesigned two adapters and a unique adaptation procedure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.18094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Conventional text-to-speech (TTS) research has predominantly focused on
enhancing the quality of synthesized speech for speakers in the training
dataset. The challenge of synthesizing lifelike speech for unseen,
out-of-dataset speakers, especially those with limited reference data, remains
a significant and unresolved problem. While zero-shot or few-shot
speaker-adaptive TTS approaches have been explored, they have many limitations.
Zero-shot approaches tend to suffer from insufficient generalization
performance to reproduce the voice of speakers with heavy accents. While
few-shot methods can reproduce highly varying accents, they bring a significant
storage burden and the risk of overfitting and catastrophic forgetting. In
addition, prior approaches only provide either zero-shot or few-shot
adaptation, constraining their utility across varied real-world scenarios with
different demands. Besides, most current evaluations of speaker-adaptive TTS
are conducted only on datasets of native speakers, inadvertently neglecting a
vast portion of non-native speakers with diverse accents. Our proposed
framework unifies both zero-shot and few-shot speaker adaptation strategies,
which we term as "instant" and "fine-grained" adaptations based on their
merits. To alleviate the insufficient generalization performance observed in
zero-shot speaker adaptation, we designed two innovative discriminators and
introduced a memory mechanism for the speech decoder. To prevent catastrophic
forgetting and reduce storage implications for few-shot speaker adaptation, we
designed two adapters and a unique adaptation procedure.