基于说话人自适应hmm的越南语文本转语音系统

2019 11th International Conference on Knowledge and Systems Engineering (KSE) Pub Date : 2019-10-01 DOI:10.1109/KSE.2019.8919326

Duy Khanh Ninh

{"title":"基于说话人自适应hmm的越南语文本转语音系统","authors":"Duy Khanh Ninh","doi":"10.1109/KSE.2019.8919326","DOIUrl":null,"url":null,"abstract":"This paper describes the first attempt in developing a Vietnamese HMM-based Text-to-Speech system using the speaker-adaptive approach. Although speaker-dependent systems have been built widely, no speaker-adaptive system has been developed for Vietnamese so far. We collected speech data from several Vietnamese native speakers and employed state-of-the-art speech analysis, model training and speaker adaptation techniques to develop the system. Besides, we performed perceptual experiments to compare the quality of speaker-adapted (SA) voices built on the average voice model and speaker-dependent (SD) voices built on SD models, and to confirm the effects of contextual features including word boundary (WB) and part-of-speech (POS) on the quality of synthetic speech. Evaluation results show that SA voices have significantly higher naturalness than SD voices when the same limited contextual feature set excluding WB and POS was used. In addition, SA voices trained with limited contextual features excluding WB and POS still have better quality than SD voices trained with full contextual features including WB and POS. These results show the robustness of the speaker-adaptive over the speaker-dependent approach for Vietnamese statistical parametric speech synthesis.","PeriodicalId":439841,"journal":{"name":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System\",\"authors\":\"Duy Khanh Ninh\",\"doi\":\"10.1109/KSE.2019.8919326\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the first attempt in developing a Vietnamese HMM-based Text-to-Speech system using the speaker-adaptive approach. Although speaker-dependent systems have been built widely, no speaker-adaptive system has been developed for Vietnamese so far. We collected speech data from several Vietnamese native speakers and employed state-of-the-art speech analysis, model training and speaker adaptation techniques to develop the system. Besides, we performed perceptual experiments to compare the quality of speaker-adapted (SA) voices built on the average voice model and speaker-dependent (SD) voices built on SD models, and to confirm the effects of contextual features including word boundary (WB) and part-of-speech (POS) on the quality of synthetic speech. Evaluation results show that SA voices have significantly higher naturalness than SD voices when the same limited contextual feature set excluding WB and POS was used. In addition, SA voices trained with limited contextual features excluding WB and POS still have better quality than SD voices trained with full contextual features including WB and POS. These results show the robustness of the speaker-adaptive over the speaker-dependent approach for Vietnamese statistical parametric speech synthesis.\",\"PeriodicalId\":439841,\"journal\":{\"name\":\"2019 11th International Conference on Knowledge and Systems Engineering (KSE)\",\"volume\":\"261 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 11th International Conference on Knowledge and Systems Engineering (KSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KSE.2019.8919326\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 11th International Conference on Knowledge and Systems Engineering (KSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE.2019.8919326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文描述了使用说话人自适应方法开发基于越南语hmm的文本到语音系统的首次尝试。虽然依赖于说话人的系统已经广泛建立，但迄今为止还没有针对越南语的说话人自适应系统。我们收集了几位越南语母语者的语音数据，并采用了最先进的语音分析、模型训练和说话者适应技术来开发系统。此外，我们还进行了感知实验，比较了基于平均语音模型构建的说话人自适应(SA)语音和基于SD模型构建的说话人依赖(SD)语音的质量，并证实了词边界(WB)和词性(POS)等语境特征对合成语音质量的影响。评价结果表明，当使用相同的有限上下文特征集(不包括WB和POS)时，SA语音的自然度明显高于SD语音。此外，使用不包括WB和POS的有限上下文特征训练的SA语音仍然比使用包括WB和POS的完整上下文特征训练的SD语音质量更好。这些结果表明，在越南语统计参数语音合成中，说话人自适应方法比说话人依赖方法具有鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System

This paper describes the first attempt in developing a Vietnamese HMM-based Text-to-Speech system using the speaker-adaptive approach. Although speaker-dependent systems have been built widely, no speaker-adaptive system has been developed for Vietnamese so far. We collected speech data from several Vietnamese native speakers and employed state-of-the-art speech analysis, model training and speaker adaptation techniques to develop the system. Besides, we performed perceptual experiments to compare the quality of speaker-adapted (SA) voices built on the average voice model and speaker-dependent (SD) voices built on SD models, and to confirm the effects of contextual features including word boundary (WB) and part-of-speech (POS) on the quality of synthetic speech. Evaluation results show that SA voices have significantly higher naturalness than SD voices when the same limited contextual feature set excluding WB and POS was used. In addition, SA voices trained with limited contextual features excluding WB and POS still have better quality than SD voices trained with full contextual features including WB and POS. These results show the robustness of the speaker-adaptive over the speaker-dependent approach for Vietnamese statistical parametric speech synthesis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 11th International Conference on Knowledge and Systems Engineering (KSE)

自引率

0.00%

发文量