美国指导的E2E语音合成印度语言

Sudhanshu Srivastava, H. Murthy
{"title":"美国指导的E2E语音合成印度语言","authors":"Sudhanshu Srivastava, H. Murthy","doi":"10.1109/SPCOM55316.2022.9840801","DOIUrl":null,"url":null,"abstract":"The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"USS Directed E2E Speech Synthesis For Indian Languages\",\"authors\":\"Sudhanshu Srivastava, H. Murthy\",\"doi\":\"10.1109/SPCOM55316.2022.9840801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.\",\"PeriodicalId\":246982,\"journal\":{\"name\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"99 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM55316.2022.9840801\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

最先进的端到端(E2E)文本到语音合成系统产生高度可理解的语音。但它们缺乏单元选择综合(USS)的音色,在资源匮乏的环境中表现不佳。此外,E2E的高合成质量受限于读语音。但是对于会话语音合成,我们观察到缺词和伪影产生的问题。另一方面,它既能准确地根据文本产生语音,又能保留音色。结合自适应融合的优点和端到端加密的连续性,本文提出了一种将经典自适应融合与基于神经网络的端到端加密系统相结合的技术,用于开发印度语言的混合模型。提出的系统使用端到端系统指导USS系统。建立了基于音节的USS和基于字符的E2E TTS系统。比较了在USS和E2E框架中生成的类音节单元的Mel谱图,并将较优单元的Mel谱图用于波形声码器中。实验使用了5种印度语言的数据集。对于使用Hybrid系统在普通E2E和USS框架中不正确合成的会话语音,可以获得DMOS分数,并且可以观察到比E2E系统平均绝对提高0.3。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
USS Directed E2E Speech Synthesis For Indian Languages
The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信