{"title":"利用语义依赖和局部卷积增强文本到语音合成的自然度和音调","authors":"","doi":"10.1016/j.neucom.2024.128430","DOIUrl":null,"url":null,"abstract":"<div><p>Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis\",\"authors\":\"\",\"doi\":\"10.1016/j.neucom.2024.128430\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.</p></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224012013\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224012013","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis
Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.