基于并行估算架构和防缠绕损失的低延迟神经语音相位预测，适用于语音生成任务

arXiv - CS - Sound Pub Date : 2024-03-26 DOI:arxiv-2403.17378

Yang Ai, Zhen-Hua Ling

{"title":"基于并行估算架构和防缠绕损失的低延迟神经语音相位预测，适用于语音生成任务","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":null,"url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\npredicts wrapped phase spectra directly from amplitude spectra. The proposed\nmodel is a cascade of a residual convolutional network and a parallel\nestimation architecture. The parallel estimation architecture is a core module\nfor direct wrapped phase prediction. This architecture consists of two parallel\nlinear convolutional layers and a phase calculation formula, imitating the\nprocess of calculating the phase spectra from the real and imaginary parts of\ncomplex spectra and strictly restricting the predicted phase values to the\nprincipal value interval. To avoid the error expansion issue caused by phase\nwrapping, we design anti-wrapping training losses defined between the predicted\nwrapped phase spectra and natural ones by activating the instantaneous phase\nerror, group delay error and instantaneous angular frequency error using an\nanti-wrapping function. We mathematically demonstrate that the anti-wrapping\nfunction should possess three properties, namely parity, periodicity and\nmonotonicity. We also achieve low-latency streamable phase prediction by\ncombining causal convolutions and knowledge distillation training strategies.\nFor both analysis-synthesis and specific speech generation tasks, experimental\nresults show that our proposed neural speech phase prediction model outperforms\nthe iterative phase estimation algorithms and neural network-based phase\nprediction methods in terms of phase prediction precision, efficiency and\nrobustness. Compared with HiFi-GAN-based waveform reconstruction method, our\nproposed model also shows outstanding efficiency advantages while ensuring the\nquality of synthesized speech. To the best of our knowledge, we are the first\nto directly predict speech phase spectra from amplitude spectra only via neural\nnetworks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"106 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks\",\"authors\":\"Yang Ai, Zhen-Hua Ling\",\"doi\":\"arxiv-2403.17378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel neural speech phase prediction model which\\npredicts wrapped phase spectra directly from amplitude spectra. The proposed\\nmodel is a cascade of a residual convolutional network and a parallel\\nestimation architecture. The parallel estimation architecture is a core module\\nfor direct wrapped phase prediction. This architecture consists of two parallel\\nlinear convolutional layers and a phase calculation formula, imitating the\\nprocess of calculating the phase spectra from the real and imaginary parts of\\ncomplex spectra and strictly restricting the predicted phase values to the\\nprincipal value interval. To avoid the error expansion issue caused by phase\\nwrapping, we design anti-wrapping training losses defined between the predicted\\nwrapped phase spectra and natural ones by activating the instantaneous phase\\nerror, group delay error and instantaneous angular frequency error using an\\nanti-wrapping function. We mathematically demonstrate that the anti-wrapping\\nfunction should possess three properties, namely parity, periodicity and\\nmonotonicity. We also achieve low-latency streamable phase prediction by\\ncombining causal convolutions and knowledge distillation training strategies.\\nFor both analysis-synthesis and specific speech generation tasks, experimental\\nresults show that our proposed neural speech phase prediction model outperforms\\nthe iterative phase estimation algorithms and neural network-based phase\\nprediction methods in terms of phase prediction precision, efficiency and\\nrobustness. Compared with HiFi-GAN-based waveform reconstruction method, our\\nproposed model also shows outstanding efficiency advantages while ensuring the\\nquality of synthesized speech. To the best of our knowledge, we are the first\\nto directly predict speech phase spectra from amplitude spectra only via neural\\nnetworks.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"106 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2403.17378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.17378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种新颖的神经语音相位预测模型，该模型可直接从振幅频谱预测包裹的相位频谱。所提出的模型是一个残差卷积网络和一个并行估计架构的级联。并行估计架构是直接进行包裹相位预测的核心模块。该架构由两个并行线性卷积层和一个相位计算公式组成，模仿了从复杂频谱的实部和虚部计算相位频谱的过程，并将预测的相位值严格限制在主值区间内。为了避免相位裹包引起的误差扩大问题，我们设计了反裹包训练损耗，通过使用反裹包函数激活瞬时相位误差、群延迟误差和瞬时角频率误差，在预测裹包相位谱和自然相位谱之间定义反裹包训练损耗。我们用数学方法证明了反包函数应具备三个特性，即奇偶性、周期性和单调性。对于分析-合成和特定语音生成任务，实验结果表明，我们提出的神经语音相位预测模型在相位预测精度、效率和稳健性方面优于迭代相位估计算法和基于神经网络的相位预测方法。与基于 HiFiGAN 的波形重构方法相比，我们提出的模型在保证合成语音质量的同时，也表现出了突出的效率优势。据我们所知，我们是第一个仅通过神经网络从振幅频谱直接预测语音相位频谱的人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量