Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN

Yanli Li, Congyi Wang
{"title":"Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN","authors":"Yanli Li, Congyi Wang","doi":"10.1145/3573834.3574506","DOIUrl":null,"url":null,"abstract":"Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their lightweight and parallel structures, which enable them to generate a high-fidelity waveform in a real-time manner. Most existing GAN-based vocoders adopt the Least Square GAN (LSGAN) training framework. In this paper, we analyze the weaknesses of the LSGAN waveform synthesis framework and, inspired by Relativistic GAN, propose a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN). In this method, we consider the pointwise truism score distribution of real and fake wave segments and combine the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability. To demonstrate the effectiveness and generalization ability of our method, subjective and objective experiments have been conducted based on Avocodo, UnivNet, HiFiGAN, ParallelWaveGAN, and MelGAN vocoders, which show a consistent performance boost over those typical LSGAN-based vocoders. Moreover, our T-PRLSGAN can support multiple types of discriminators, i.e. multi-scale wave discriminator(MSD), multi-period discriminator(MPD), and multi-resolution spectrogram discriminator(MRD), without modifying their architecture or inference speed.","PeriodicalId":345434,"journal":{"name":"Proceedings of the 4th International Conference on Advanced Information Science and System","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Advanced Information Science and System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573834.3574506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their lightweight and parallel structures, which enable them to generate a high-fidelity waveform in a real-time manner. Most existing GAN-based vocoders adopt the Least Square GAN (LSGAN) training framework. In this paper, we analyze the weaknesses of the LSGAN waveform synthesis framework and, inspired by Relativistic GAN, propose a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN). In this method, we consider the pointwise truism score distribution of real and fake wave segments and combine the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability. To demonstrate the effectiveness and generalization ability of our method, subjective and objective experiments have been conducted based on Avocodo, UnivNet, HiFiGAN, ParallelWaveGAN, and MelGAN vocoders, which show a consistent performance boost over those typical LSGAN-based vocoders. Moreover, our T-PRLSGAN can support multiple types of discriminators, i.e. multi-scale wave discriminator(MSD), multi-period discriminator(MPD), and multi-resolution spectrogram discriminator(MRD), without modifying their architecture or inference speed.
利用截断点相对最小二乘GAN改进基于GAN的神经声码器
神经声码器因其生成质量高、推理速度快而广泛应用于现代文本到语音(TTS)和语音转换(VC)系统中。最近,基于gan的神经声编码器由于其轻量级和并行结构而引起了人们的极大兴趣,这使得它们能够实时生成高保真波形。现有的基于GAN的声编码器大多采用最小二乘GAN (LSGAN)训练框架。在本文中,我们分析了LSGAN波形合成框架的缺点,并在相对论GAN的启发下,提出了LSGAN框架的一个简单而有效的变体,称为截断点相对论LSGAN (T-PRLSGAN)。在该方法中,我们考虑了真假波段的点向自真性分数分布,并将均方误差(MSE)损失与所提出的截断点向相对差异损失相结合,增加了发生器欺骗鉴别器的难度,从而提高了音频生成的质量和稳定性。为了验证该方法的有效性和泛化能力,我们在Avocodo、UnivNet、HiFiGAN、ParallelWaveGAN和MelGAN声码器上进行了主观和客观的实验,结果表明该方法的性能比那些典型的基于lsgan的声码器有一致的提升。此外,我们的T-PRLSGAN可以支持多种类型的鉴别器,即多尺度波鉴别器(MSD),多周期鉴别器(MPD)和多分辨率谱图鉴别器(MRD),而无需改变其结构或推理速度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信