基于Tacotron2的端到端文本到语音合成中有限数据的说话人自适应实验

IF 1.3 Q4 TELECOMMUNICATIONS

Infocommunications Journal Pub Date : 2022-01-01 DOI:10.36244/icj.2022.3.7

A. Mandeel, M. Al-Radhi, T. Csapó

{"title":"基于Tacotron2的端到端文本到语音合成中有限数据的说话人自适应实验","authors":"A. Mandeel, M. Al-Radhi, T. Csapó","doi":"10.36244/icj.2022.3.7","DOIUrl":null,"url":null,"abstract":"Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.","PeriodicalId":42504,"journal":{"name":"Infocommunications Journal","volume":"17 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2\",\"authors\":\"A. Mandeel, M. Al-Radhi, T. Csapó\",\"doi\":\"10.36244/icj.2022.3.7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.\",\"PeriodicalId\":42504,\"journal\":{\"name\":\"Infocommunications Journal\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Infocommunications Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.36244/icj.2022.3.7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infocommunications Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36244/icj.2022.3.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}

引用次数: 4

摘要

语音合成的目的是从文本中生成类似人类的语音。如今，在端到端系统中，如果目标说话者有足够大的数据集，就可以实现高度自然的合成语音。然而，通常需要适应只有少数训练样本的目标说话者。由于训练样本过少，有限的数据说话人适应可能是一个难题。有限的说话人数据集可能会出现问题，例如语言标记的不规则分配(即，合成语音中遗漏了一些语音)。为了构建轻量级系统，测量最小数据样本和训练周期的数量对于获得合理的质量至关重要。我们对四个目标说话人进行了详细的实验，用于自适应说话人文本到语音(TTS)合成，以展示端到端Tacotron2模型和WaveGlow神经声码器在多个训练数据样本和训练长度下的英语数据集的性能。根据我们的客观和主观评价调查，Tacotron2模型在100句数据(文本和音频对)下对未见的目标说话人的语音质量和相似度方面表现良好，训练时间相对较短。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Speech synthesis has the aim of generating humanlike speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Infocommunications Journal TELECOMMUNICATIONS-

CiteScore

1.90

自引率

27.30%

发文量