利用语义依赖和局部卷积增强文本到语音合成的自然度和音调

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-08-22 DOI:10.1016/j.neucom.2024.128430

Chenglong Jiang , Ying Gao , Wing W.Y. Ng , Jiyong Zhou , Jinghui Zhong , Hongzhong Zhen , Xiping Hu

{"title":"利用语义依赖和局部卷积增强文本到语音合成的自然度和音调","authors":"Chenglong Jiang , Ying Gao , Wing W.Y. Ng , Jiyong Zhou , Jinghui Zhong , Hongzhong Zhen , Xiping Hu","doi":"10.1016/j.neucom.2024.128430","DOIUrl":null,"url":null,"abstract":"<div><p>Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"608 ","pages":"Article 128430"},"PeriodicalIF":6.5000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis\",\"authors\":\"Chenglong Jiang , Ying Gao , Wing W.Y. Ng , Jiyong Zhou , Jinghui Zhong , Hongzhong Zhen , Xiping Hu\",\"doi\":\"10.1016/j.neucom.2024.128430\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.</p></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"608 \",\"pages\":\"Article 128430\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2024-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224012013\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224012013","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于自我关注的网络在并行训练和全局上下文建模方面表现出色，因此越来越受欢迎。然而，它可能无法捕捉局部依赖性，尤其是在具有较强局部相关性的数据集中。为了应对这一挑战，我们提出了一种利用语义依赖性从原文中提取语言信息的新方法。节点之间的语义关系是完善自我关注分布的先验知识。此外，为了更好地融合本地上下文信息，我们引入了一维卷积神经网络，利用输入字符之间的强相关性，在自我关注机制中生成查询矩阵和值矩阵。我们将自注意网络的这一变体应用于文本到语音任务，并提出了一种非自回归神经文本到语音模型。为了提高发音的准确性，我们在模型训练中将音调与音素作为独立特征分开。实验结果表明，我们的模型在语音合成中表现出色。具体来说，所提出的方法显著改善了语音中停顿、重音和语调的处理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semantic dependency and local convolution for enhancing naturalness and tone in text-to-speech synthesis

Self-attention-based networks have become increasingly popular due to their exceptional performance in parallel training and global context modeling. However, it may fall short of capturing local dependencies, particularly in datasets with strong local correlations. To address this challenge, we propose a novel method that utilizes semantic dependency to extract linguistic information from the original text. The semantic relationship between nodes serves as prior knowledge to refine the self-attention distribution. Additionally, to better fuse local contextual information, we introduce a one-dimensional convolution neural network to generate the query and value matrices in the self-attention mechanism, taking advantage of the strong correlation between input characters. We apply this variant of the self-attention network to text-to-speech tasks and propose a non-autoregressive neural text-to-speech model. To enhance pronunciation accuracy, we separate tones from phonemes as independent features in model training. Experimental results show that our model yields good performance in speech synthesis. Specifically, the proposed method significantly improves the processing of pause, stress, and intonation in speech.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.