{"title":"美国指导的E2E语音合成印度语言","authors":"Sudhanshu Srivastava, H. Murthy","doi":"10.1109/SPCOM55316.2022.9840801","DOIUrl":null,"url":null,"abstract":"The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"USS Directed E2E Speech Synthesis For Indian Languages\",\"authors\":\"Sudhanshu Srivastava, H. Murthy\",\"doi\":\"10.1109/SPCOM55316.2022.9840801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.\",\"PeriodicalId\":246982,\"journal\":{\"name\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"99 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM55316.2022.9840801\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
USS Directed E2E Speech Synthesis For Indian Languages
The state-of-the-art end-to-end (E2E) text-to-speech synthesis systems produce highly intelligible speech. But they lack the timbre of Unit Selection Synthesis (USS) and do not perform well in a low-resource environment. Moreover, the high synthesis quality of E2E is limited to read speech. But for conversational speech synthesis, we observe the problem of missing words and the creation of artifacts. On the other hand, USS not only produces the exact speech according to the text but also preserves the timbre. Combining the advantages of USS and the continuity property of E2E, this paper proposes a technique to combine the classical USS with the neural-network-based E2E system to develop a hybrid model for Indian languages.The proposed system guides the USS system using the E2E system. Syllable-based USS and character-based E2E TTS systems are built. Mel spectrograms of syllable-like units generated in the USS and E2E frameworks are compared, and the mel-spectrogram of the better unit is used in the waveglow vocoder. A dataset of 5 Indian languages is used for the experiments. DMOS scores are obtained for conversational speech utterances improperly synthesized in the vanilla E2E and USS frameworks using the Hybrid system and an average absolute improvement of 0.3 is observed over the E2E system.