{"title":"利用语音特征进行资源稀缺的文本到语音的跨语言迁移","authors":"J. A. Louw","doi":"10.21437/ssw.2023-9","DOIUrl":null,"url":null,"abstract":"In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-lingual transfer using phonological features for resource-scarce text-to-speech\",\"authors\":\"J. A. Louw\",\"doi\":\"10.21437/ssw.2023-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.\",\"PeriodicalId\":346639,\"journal\":{\"name\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/ssw.2023-9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cross-lingual transfer using phonological features for resource-scarce text-to-speech
In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resourcescarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.