Neural Speech Coding for Real-Time Communications Using Constant Bitrate Scalar Quantization

IF 8.7 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Signal Processing Pub Date : 2024-11-04 DOI:10.1109/JSTSP.2024.3491575

Andreas Brendel;Nicola Pia;Kishan Gupta;Lyonel Behringer;Guillaume Fuchs;Markus Multrus

{"title":"Neural Speech Coding for Real-Time Communications Using Constant Bitrate Scalar Quantization","authors":"Andreas Brendel;Nicola Pia;Kishan Gupta;Lyonel Behringer;Guillaume Fuchs;Markus Multrus","doi":"10.1109/JSTSP.2024.3491575","DOIUrl":null,"url":null,"abstract":"Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short-Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1462-1476"},"PeriodicalIF":8.7000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10742547/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short-Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.

查看原文本刊更多论文

神经音频编码已成为一个生动的研究方向，它有望在传统编码技术无法实现的极低比特率下实现良好的音频质量。在这里，端到端可训练的类似自动编码器的模型代表了目前的技术水平，在自动编码器的瓶颈处学习离散表示。这样就能高效传输输入音频信号。神经编解码器的离散表示通常是通过对神经编码器的输出应用量化器来生成的。在几乎所有最先进的神经音频编码方法中，这种量化器都是以矢量量化器（VQ）的形式实现的，人们花费了大量精力来减轻这种量化技术与神经音频编码器结合使用时的缺点。在本文中，我们提出并分析了基于投影标量量化（SQ）的 VQ 简单替代方案。这些量化技术不需要任何额外的损耗、调度参数或编码本存储，从而简化了神经音频编解码器的训练。对于实时语音通信应用，这些神经编解码器需要在低复杂度、低延迟和低比特率的条件下运行。为了应对这些挑战，我们提出了一种基于 SQ 和短时傅立叶变换 (STFT) 表示法的新因果网络架构。所提出的方法在极低复杂度和低比特率条件下表现尤为出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Signal Processing 工程技术-工程：电子与电气

CiteScore

19.00

自引率

1.30%

发文量

135

审稿时长

3 months

期刊介绍： The IEEE Journal of Selected Topics in Signal Processing (JSTSP) focuses on the Field of Interest of the IEEE Signal Processing Society, which encompasses the theory and application of various signal processing techniques. These techniques include filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals using digital or analog devices. The term "signal" covers a wide range of data types, including audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and others. The journal format allows for in-depth exploration of signal processing topics, enabling the Society to cover both established and emerging areas. This includes interdisciplinary fields such as biomedical engineering and language processing, as well as areas not traditionally associated with engineering.