{"title":"低资源越南语语音合成的多特征跨语言迁移学习方法","authors":"Zhi Qiao, Jian Yang, Zhan Wang","doi":"10.1145/3611450.3611476","DOIUrl":null,"url":null,"abstract":"Abstract—Based on neural network end-to-end speech synthesis systems, high-quality speech can be synthesized when there is sufficient training data. However, it is difficult for languages with small datasets to synthesize speech with high quality and naturalness. Vietnamese is a tonal language, belonging to the Vietic branch of the Austroasiatic language family, which uses a spelling system. To improve the quality and naturalness of speech synthesis with limited dataset resources, we first use transfer learning to improve the acoustic model of Vietnamese by leveraging the similarities in pronunciation and grammar between Mandarin Chinese and Vietnamese. Secondly, based on the prosodic characteristics of Vietnamese, we use a \"speech-text\" alignment tool to extract prosodic boundary information and supplement it to the training text sequence. Using FastSpeech2 as the experimental baseline system, we designed and added a prosody embedding layer. The experimental results show that the model trained with prosodic markers has better prosody expression compared to the original text. Furthermore, compared to the baseline system, adding the prosody embedding layer improved the prosody expression of the synthesized speech and eliminated the need for marked text during speech synthesis.","PeriodicalId":289906,"journal":{"name":"Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Feature Cross-Lingual Transfer Learning Approach for Low-Resource Vietnamese Speech Synthesis\",\"authors\":\"Zhi Qiao, Jian Yang, Zhan Wang\",\"doi\":\"10.1145/3611450.3611476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract—Based on neural network end-to-end speech synthesis systems, high-quality speech can be synthesized when there is sufficient training data. However, it is difficult for languages with small datasets to synthesize speech with high quality and naturalness. Vietnamese is a tonal language, belonging to the Vietic branch of the Austroasiatic language family, which uses a spelling system. To improve the quality and naturalness of speech synthesis with limited dataset resources, we first use transfer learning to improve the acoustic model of Vietnamese by leveraging the similarities in pronunciation and grammar between Mandarin Chinese and Vietnamese. Secondly, based on the prosodic characteristics of Vietnamese, we use a \\\"speech-text\\\" alignment tool to extract prosodic boundary information and supplement it to the training text sequence. Using FastSpeech2 as the experimental baseline system, we designed and added a prosody embedding layer. The experimental results show that the model trained with prosodic markers has better prosody expression compared to the original text. Furthermore, compared to the baseline system, adding the prosody embedding layer improved the prosody expression of the synthesized speech and eliminated the need for marked text during speech synthesis.\",\"PeriodicalId\":289906,\"journal\":{\"name\":\"Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3611450.3611476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3611450.3611476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Feature Cross-Lingual Transfer Learning Approach for Low-Resource Vietnamese Speech Synthesis
Abstract—Based on neural network end-to-end speech synthesis systems, high-quality speech can be synthesized when there is sufficient training data. However, it is difficult for languages with small datasets to synthesize speech with high quality and naturalness. Vietnamese is a tonal language, belonging to the Vietic branch of the Austroasiatic language family, which uses a spelling system. To improve the quality and naturalness of speech synthesis with limited dataset resources, we first use transfer learning to improve the acoustic model of Vietnamese by leveraging the similarities in pronunciation and grammar between Mandarin Chinese and Vietnamese. Secondly, based on the prosodic characteristics of Vietnamese, we use a "speech-text" alignment tool to extract prosodic boundary information and supplement it to the training text sequence. Using FastSpeech2 as the experimental baseline system, we designed and added a prosody embedding layer. The experimental results show that the model trained with prosodic markers has better prosody expression compared to the original text. Furthermore, compared to the baseline system, adding the prosody embedding layer improved the prosody expression of the synthesized speech and eliminated the need for marked text during speech synthesis.