Huaifeng Zhang, Pengfei Wu, Guigeng Li, Yuan An, Hao Zhang
{"title":"A streaming variable neural speech codec","authors":"Huaifeng Zhang, Pengfei Wu, Guigeng Li, Yuan An, Hao Zhang","doi":"10.1016/j.engappai.2025.112418","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantized variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of the speech signal, and constructs a latent space codebook to facilitate the effective mapping of feature vectors to discrete vectors. Based on the harmonic characteristics of speech signals and the inherent defects of single-scale discriminators, we introduce multi-period discriminators and multi-scale discriminators. The training process uses a balanced training strategy to ensure the balance between codebook utilization and training weights, and utilizes the Short-Time Fourier Transform (STFT) spectrum that can provide more accurate time–frequency resolution to compute the reconstruction loss. We introduce codebook loss to improve the utilization rate of the codebook and accelerate the convergence of the model. In the inference process, we use a quantizer selection strategy to achieve adaptive adjustment of variable bitrate. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite speech communication and narrowband instant messaging. The demo has been publicly released at <span><span>https://svcodec.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"162 ","pages":"Article 112418"},"PeriodicalIF":8.0000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625024492","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantized variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of the speech signal, and constructs a latent space codebook to facilitate the effective mapping of feature vectors to discrete vectors. Based on the harmonic characteristics of speech signals and the inherent defects of single-scale discriminators, we introduce multi-period discriminators and multi-scale discriminators. The training process uses a balanced training strategy to ensure the balance between codebook utilization and training weights, and utilizes the Short-Time Fourier Transform (STFT) spectrum that can provide more accurate time–frequency resolution to compute the reconstruction loss. We introduce codebook loss to improve the utilization rate of the codebook and accelerate the convergence of the model. In the inference process, we use a quantizer selection strategy to achieve adaptive adjustment of variable bitrate. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite speech communication and narrowband instant messaging. The demo has been publicly released at https://svcodec.github.io/.
本文提出了一种基于SoundStream网络框架的针对超低比特率场景的可变比特率流神经语音编解码器。编解码器采用矢量量化变分自编码器(VQ-VAE)算法捕捉语音信号的时间结构和频谱特征,并构造一个潜在空间码本,便于特征向量到离散向量的有效映射。基于语音信号的谐波特性和单尺度鉴别器的固有缺陷,引入了多周期鉴别器和多尺度鉴别器。训练过程采用平衡训练策略,以确保码本利用率和训练权值之间的平衡,并利用短时傅立叶变换(STFT)频谱提供更精确的时频分辨率来计算重建损失。为了提高码本的利用率,加快模型的收敛速度,我们引入了码本损耗。在推理过程中,我们采用量化器选择策略来实现可变比特率的自适应调整。客观和主观实验表明,我们提出的新型神经语音编解码器在保持神经语音编解码器低延迟特性的同时,在重构语音的自然度和质量方面优于传统的经典语音编解码器和现有的神经语音编解码器。multistimulus test With hidden reference and anchor (MUSHRA) score为87分,非常适合卫星语音通信、窄带即时通讯等超低码率语音压缩应用。该演示已在https://svcodec.github.io/上公开发布。
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.