A streaming variable neural speech codec

IF 8 2区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Huaifeng Zhang, Pengfei Wu, Guigeng Li, Yuan An, Hao Zhang
{"title":"A streaming variable neural speech codec","authors":"Huaifeng Zhang,&nbsp;Pengfei Wu,&nbsp;Guigeng Li,&nbsp;Yuan An,&nbsp;Hao Zhang","doi":"10.1016/j.engappai.2025.112418","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantized variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of the speech signal, and constructs a latent space codebook to facilitate the effective mapping of feature vectors to discrete vectors. Based on the harmonic characteristics of speech signals and the inherent defects of single-scale discriminators, we introduce multi-period discriminators and multi-scale discriminators. The training process uses a balanced training strategy to ensure the balance between codebook utilization and training weights, and utilizes the Short-Time Fourier Transform (STFT) spectrum that can provide more accurate time–frequency resolution to compute the reconstruction loss. We introduce codebook loss to improve the utilization rate of the codebook and accelerate the convergence of the model. In the inference process, we use a quantizer selection strategy to achieve adaptive adjustment of variable bitrate. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite speech communication and narrowband instant messaging. The demo has been publicly released at <span><span>https://svcodec.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"162 ","pages":"Article 112418"},"PeriodicalIF":8.0000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625024492","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a variable bit rate streaming neural speech codec designed for ultra-low bit rate scenarios, based on the SoundStream network framework. The codec employs the vector quantized variational auto-encoder (VQ-VAE) algorithm to capture the temporal structure and spectral characteristics of the speech signal, and constructs a latent space codebook to facilitate the effective mapping of feature vectors to discrete vectors. Based on the harmonic characteristics of speech signals and the inherent defects of single-scale discriminators, we introduce multi-period discriminators and multi-scale discriminators. The training process uses a balanced training strategy to ensure the balance between codebook utilization and training weights, and utilizes the Short-Time Fourier Transform (STFT) spectrum that can provide more accurate time–frequency resolution to compute the reconstruction loss. We introduce codebook loss to improve the utilization rate of the codebook and accelerate the convergence of the model. In the inference process, we use a quantizer selection strategy to achieve adaptive adjustment of variable bitrate. Objective and subjective experiments demonstrate that our proposed new neural speech codec outperforms traditional classical speech codecs and existing neural speech codecs in terms of reconstructed speech naturalness and quality while maintaining the low latency characteristic of neural speech codecs. With a multi-stimulus test with hidden reference and anchor (MUSHRA) score of 87, it is highly suitable for ultra-low bit rate speech compression applications such as satellite speech communication and narrowband instant messaging. The demo has been publicly released at https://svcodec.github.io/.
一个流式可变神经语音编解码器
本文提出了一种基于SoundStream网络框架的针对超低比特率场景的可变比特率流神经语音编解码器。编解码器采用矢量量化变分自编码器(VQ-VAE)算法捕捉语音信号的时间结构和频谱特征,并构造一个潜在空间码本,便于特征向量到离散向量的有效映射。基于语音信号的谐波特性和单尺度鉴别器的固有缺陷,引入了多周期鉴别器和多尺度鉴别器。训练过程采用平衡训练策略,以确保码本利用率和训练权值之间的平衡,并利用短时傅立叶变换(STFT)频谱提供更精确的时频分辨率来计算重建损失。为了提高码本的利用率,加快模型的收敛速度,我们引入了码本损耗。在推理过程中,我们采用量化器选择策略来实现可变比特率的自适应调整。客观和主观实验表明,我们提出的新型神经语音编解码器在保持神经语音编解码器低延迟特性的同时,在重构语音的自然度和质量方面优于传统的经典语音编解码器和现有的神经语音编解码器。multistimulus test With hidden reference and anchor (MUSHRA) score为87分,非常适合卫星语音通信、窄带即时通讯等超低码率语音压缩应用。该演示已在https://svcodec.github.io/上公开发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence 工程技术-工程:电子与电气
CiteScore
9.60
自引率
10.00%
发文量
505
审稿时长
68 days
期刊介绍: Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信