David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive
{"title":"语音克隆:用有限的多说话人语料库训练说话人选择","authors":"David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive","doi":"10.21437/ssw.2023-27","DOIUrl":null,"url":null,"abstract":"Text-To-Speech synthesis with few data is a challenging task, in particular when choosing the target speaker is not an option. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, the model must first be trained on a large corpus of thousands of hours and hundreds of speakers. In this paper, we tackle the challenge of cloning voices with a much smaller corpus, us-ing both the speaker adaptation and speaker encoding methods. We study the impact of selecting our training speakers based on their similarity to the targets. We train models using only the training speakers closest/farthest to our targets in terms of speaker similarity from a pool of 14 speakers. We show that the selection of speakers in the training set has an impact on the similarity to the target speaker. The effect is more prominent for speaker encoding than adaptation. However, it remains nuanced when it comes to naturalness.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus\",\"authors\":\"David Guennec, Lily Wadoux, A. Sini, N. Barbot, Damien Lolive\",\"doi\":\"10.21437/ssw.2023-27\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-To-Speech synthesis with few data is a challenging task, in particular when choosing the target speaker is not an option. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, the model must first be trained on a large corpus of thousands of hours and hundreds of speakers. In this paper, we tackle the challenge of cloning voices with a much smaller corpus, us-ing both the speaker adaptation and speaker encoding methods. We study the impact of selecting our training speakers based on their similarity to the targets. We train models using only the training speakers closest/farthest to our targets in terms of speaker similarity from a pool of 14 speakers. We show that the selection of speakers in the training set has an impact on the similarity to the target speaker. The effect is more prominent for speaker encoding than adaptation. However, it remains nuanced when it comes to naturalness.\",\"PeriodicalId\":346639,\"journal\":{\"name\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/ssw.2023-27\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Voice Cloning: Training Speaker Selection with Limited Multi-Speaker Corpus
Text-To-Speech synthesis with few data is a challenging task, in particular when choosing the target speaker is not an option. Voice cloning is a popular method to alleviate these issues using only a few minutes of target speech. To do this, the model must first be trained on a large corpus of thousands of hours and hundreds of speakers. In this paper, we tackle the challenge of cloning voices with a much smaller corpus, us-ing both the speaker adaptation and speaker encoding methods. We study the impact of selecting our training speakers based on their similarity to the targets. We train models using only the training speakers closest/farthest to our targets in terms of speaker similarity from a pool of 14 speakers. We show that the selection of speakers in the training set has an impact on the similarity to the target speaker. The effect is more prominent for speaker encoding than adaptation. However, it remains nuanced when it comes to naturalness.