{"title":"利用截断点相对最小二乘GAN改进基于GAN的神经声码器","authors":"Yanli Li, Congyi Wang","doi":"10.1145/3573834.3574506","DOIUrl":null,"url":null,"abstract":"Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their lightweight and parallel structures, which enable them to generate a high-fidelity waveform in a real-time manner. Most existing GAN-based vocoders adopt the Least Square GAN (LSGAN) training framework. In this paper, we analyze the weaknesses of the LSGAN waveform synthesis framework and, inspired by Relativistic GAN, propose a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN). In this method, we consider the pointwise truism score distribution of real and fake wave segments and combine the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability. To demonstrate the effectiveness and generalization ability of our method, subjective and objective experiments have been conducted based on Avocodo, UnivNet, HiFiGAN, ParallelWaveGAN, and MelGAN vocoders, which show a consistent performance boost over those typical LSGAN-based vocoders. Moreover, our T-PRLSGAN can support multiple types of discriminators, i.e. multi-scale wave discriminator(MSD), multi-period discriminator(MPD), and multi-resolution spectrogram discriminator(MRD), without modifying their architecture or inference speed.","PeriodicalId":345434,"journal":{"name":"Proceedings of the 4th International Conference on Advanced Information Science and System","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN\",\"authors\":\"Yanli Li, Congyi Wang\",\"doi\":\"10.1145/3573834.3574506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their lightweight and parallel structures, which enable them to generate a high-fidelity waveform in a real-time manner. Most existing GAN-based vocoders adopt the Least Square GAN (LSGAN) training framework. In this paper, we analyze the weaknesses of the LSGAN waveform synthesis framework and, inspired by Relativistic GAN, propose a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN). In this method, we consider the pointwise truism score distribution of real and fake wave segments and combine the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability. To demonstrate the effectiveness and generalization ability of our method, subjective and objective experiments have been conducted based on Avocodo, UnivNet, HiFiGAN, ParallelWaveGAN, and MelGAN vocoders, which show a consistent performance boost over those typical LSGAN-based vocoders. Moreover, our T-PRLSGAN can support multiple types of discriminators, i.e. multi-scale wave discriminator(MSD), multi-period discriminator(MPD), and multi-resolution spectrogram discriminator(MRD), without modifying their architecture or inference speed.\",\"PeriodicalId\":345434,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Advanced Information Science and System\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Advanced Information Science and System\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573834.3574506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Advanced Information Science and System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573834.3574506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN
Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their lightweight and parallel structures, which enable them to generate a high-fidelity waveform in a real-time manner. Most existing GAN-based vocoders adopt the Least Square GAN (LSGAN) training framework. In this paper, we analyze the weaknesses of the LSGAN waveform synthesis framework and, inspired by Relativistic GAN, propose a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN). In this method, we consider the pointwise truism score distribution of real and fake wave segments and combine the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability. To demonstrate the effectiveness and generalization ability of our method, subjective and objective experiments have been conducted based on Avocodo, UnivNet, HiFiGAN, ParallelWaveGAN, and MelGAN vocoders, which show a consistent performance boost over those typical LSGAN-based vocoders. Moreover, our T-PRLSGAN can support multiple types of discriminators, i.e. multi-scale wave discriminator(MSD), multi-period discriminator(MPD), and multi-resolution spectrogram discriminator(MRD), without modifying their architecture or inference speed.