基于端到端模型和文本归一化的越南语语音合成

2020 7th NAFOSTED Conference on Information and Computer Science (NICS) Pub Date : 2020-11-26 DOI:10.1109/NICS51282.2020.9335905

D. Nhan, Nguyen Minh Tri, Cao Xuan Nam

{"title":"基于端到端模型和文本归一化的越南语语音合成","authors":"D. Nhan, Nguyen Minh Tri, Cao Xuan Nam","doi":"10.1109/NICS51282.2020.9335905","DOIUrl":null,"url":null,"abstract":"Speech synthesis systems are now getting smarter and more natural thanks to the power of deep neural networks. However, each language has a different phonological and contextual characteristics, we have conducted experiments, statistics, and applied Vietnamese phonetics to improve speech synthesis systems based on Tacotron2 neural networks. Our methods achieve the accuracy of 97% in text normalization task, and the synthesized speeches with a MOS score of 3.97, asymptotic to 4.43 of the voices that are directly recorded. We also provide a library for standardizing Vietnamese text called Vinorm and a package that converts text into a phonetic format called Viphoneme, which is used as an input for end-to-end neural networks, make the synthesis process faster, more intelligent and natural than using character inputs.","PeriodicalId":308944,"journal":{"name":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Vietnamese Speech Synthesis with End-to-End Model and Text Normalization\",\"authors\":\"D. Nhan, Nguyen Minh Tri, Cao Xuan Nam\",\"doi\":\"10.1109/NICS51282.2020.9335905\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech synthesis systems are now getting smarter and more natural thanks to the power of deep neural networks. However, each language has a different phonological and contextual characteristics, we have conducted experiments, statistics, and applied Vietnamese phonetics to improve speech synthesis systems based on Tacotron2 neural networks. Our methods achieve the accuracy of 97% in text normalization task, and the synthesized speeches with a MOS score of 3.97, asymptotic to 4.43 of the voices that are directly recorded. We also provide a library for standardizing Vietnamese text called Vinorm and a package that converts text into a phonetic format called Viphoneme, which is used as an input for end-to-end neural networks, make the synthesis process faster, more intelligent and natural than using character inputs.\",\"PeriodicalId\":308944,\"journal\":{\"name\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NICS51282.2020.9335905\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS51282.2020.9335905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于深度神经网络的力量，语音合成系统现在变得更加智能和自然。然而，每种语言都有不同的语音和上下文特征，我们进行了实验，统计，并应用越南语音学来改进基于Tacotron2神经网络的语音合成系统。我们的方法在文本归一化任务中达到了97%的准确率，合成语音的MOS分数为3.97，渐近于直接记录语音的4.43。我们还提供了一个名为Vinorm的标准化越南文本库和一个名为Viphoneme的将文本转换为语音格式的软件包，该软件包被用作端到端神经网络的输入，使合成过程比使用字符输入更快、更智能、更自然。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Vietnamese Speech Synthesis with End-to-End Model and Text Normalization

Speech synthesis systems are now getting smarter and more natural thanks to the power of deep neural networks. However, each language has a different phonological and contextual characteristics, we have conducted experiments, statistics, and applied Vietnamese phonetics to improve speech synthesis systems based on Tacotron2 neural networks. Our methods achieve the accuracy of 97% in text normalization task, and the synthesized speeches with a MOS score of 3.97, asymptotic to 4.43 of the voices that are directly recorded. We also provide a library for standardizing Vietnamese text called Vinorm and a package that converts text into a phonetic format called Viphoneme, which is used as an input for end-to-end neural networks, make the synthesis process faster, more intelligent and natural than using character inputs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 7th NAFOSTED Conference on Information and Computer Science (NICS)

自引率

0.00%

发文量