跨语言韵律迁移TTS的对抗性和顺序性训练

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-865

Min-Kyung Kim, Joon‐Hyuk Chang

{"title":"跨语言韵律迁移TTS的对抗性和顺序性训练","authors":"Min-Kyung Kim, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-865","DOIUrl":null,"url":null,"abstract":"This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4556-4560"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS\",\"authors\":\"Min-Kyung Kim, Joon‐Hyuk Chang\",\"doi\":\"10.21437/interspeech.2022-865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"4556-4560\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-865\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本研究提出了一种通过使用语言、说话者和韵律三种全局语音风格表示来提高文本到语音(TTS)模型性能的方法。在说话者的声音中综合不同的语言和韵律，而不管他们自己的语言和韵律是可能的。为了构建TTS模型中每个表征的嵌入，使其独立于其他表征，我们提出了一种针对TTS模型一般架构的对抗性训练方法。此外，我们引入了一种顺序训练方法，包括基于预演的持续学习，以训练复杂和少量的数据，而不会忘记先前学习的信息。实验结果表明，该方法可以生成高质量的语音，即使对于数据集中不包含的说话人表示，也可以产生高质量的说话人和韵律的相似度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS

This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量