{"title":"DARIO: Differentiable Vision Transformer Pruning With Low-Cost Proxies","authors":"Haozhe Sun;Alexandre Heuillet;Felix Mohr;Hedi Tabia","doi":"10.1109/JSTSP.2024.3501685","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3501685","url":null,"abstract":"Transformer models have gained popularity for their exceptional performance. However, these models still face the challenge of high inference latency. To improve the computational efficiency of such models, we propose a novel differentiable pruning method called DARIO (<bold>D</b>ifferenti<bold>A</b>ble vision transformer p<bold>R</b>un<bold>I</b>ng with low-cost pr<bold>O</b>xies). Our approach involves optimizing a set of gating parameters using differentiable, data-agnostic, scale-invariant, and low-cost performance proxies. DARIO is a data-agnostic pruning method, it does not need any classification heads during pruning. We evaluated DARIO on two popular state-of-the-art pre-trained ViT models, including both large (MAE-ViT) and small (MobileViT) sizes. Extensive experiments conducted across 40 diverse datasets demonstrated the effectiveness and efficiency of our DARIO method. DARIO not only significantly improves inference efficiency on modern hardware but also excels in preserving accuracy. Notably, DARIO has even achieved an increase in accuracy on MobileViT, despite only fine-tuning the last block and the classification head.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 6","pages":"997-1009"},"PeriodicalIF":8.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay","authors":"Eunkyun Lee;Seungkwon Beack;Jong Won Shin","doi":"10.1109/JSTSP.2024.3501681","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3501681","url":null,"abstract":"Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1414-1426"},"PeriodicalIF":8.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyungseob Lim;Jihyun Lee;Byeong Hyeon Kim;Inseon Jang;Hong-Goo Kang
{"title":"Perceptual Neural Audio Coding With Modified Discrete Cosine Transform","authors":"Hyungseob Lim;Jihyun Lee;Byeong Hyeon Kim;Inseon Jang;Hong-Goo Kang","doi":"10.1109/JSTSP.2024.3491576","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3491576","url":null,"abstract":"Despite efforts to leverage the modeling power of deep neural networks (DNNs) in audio coding, effectively deploying them in real-world applications is still problematic due to their high computational cost and the restricted range of target signals or achievable bit-rates. In this paper, we propose an alternative approach for integrating DNNs into a perceptual audio coder that allows for the optimization of the whole system in a data-driven, end-to-end manner. The key idea of the proposed method is to make DNNs control the quantization noise in the classic transform coding framework, specifically based on the modified discrete cosine transform (MDCT). The proposal includes a new DNN-based mechanism for adaptively adjusting the quantization step sizes of frequency bands targeting an arbitrary bit-rate, eventually acting as a data-driven differentiable psychoacoustic model. The side information regarding the adaptive quantization is also encoded and decoded by DNNs via learned representation. During training, the perceptual distortion is evaluated by a perceptual quality estimation model trained on actual human ratings so that the proposed audio codec can effectively allocate bits considering their effect on the perceptual quality. Through comparisons with legacy audio codecs (MP3 and AAC) and a neural audio codec (EnCodec), we show that our method can achieve further coding gains over the legacy codecs with a substantially lower computational load on the decoder compared to other neural audio codecs.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1490-1505"},"PeriodicalIF":8.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2024.3459324","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3459324","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 4","pages":"C2-C2"},"PeriodicalIF":8.7,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744618","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2024.3459322","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3459322","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 4","pages":"C3-C3"},"PeriodicalIF":8.7,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744789","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmet M. Elbir;Kumar Vijay Mishra;Özlem Tuğfe Demir;Emil Björnson;Angel Lozano
{"title":"Introduction to the Special Issue Near-Field Signal Processing: Algorithms, Implementations and Applications","authors":"Ahmet M. Elbir;Kumar Vijay Mishra;Özlem Tuğfe Demir;Emil Björnson;Angel Lozano","doi":"10.1109/JSTSP.2024.3465108","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3465108","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 4","pages":"541-545"},"PeriodicalIF":8.7,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744777","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multipath Component Power Delay Profile Based Ranging","authors":"Fangqing Xiao;Zilu Zhao;Dirk T. M. Slock","doi":"10.1109/JSTSP.2024.3491580","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3491580","url":null,"abstract":"Precision ranging technology has become indispensable for ensuring efficient, reliable, and low-latency fifth-generation (5G) networks. In this paper, we propose a novel ranging method which is multipath component (MPC) power delay profile (PDP) based ranging. Whereas the Received Signal Strength (RSS) only summarizes the PDP into a single characteristic, we aim to furthermore exploit the range dependent curvature of the PDP envelope over its delay spread. However, the multipath propagation only allows to sample the PDP envelope at the path delays and suffers from (slow) fading. Hence our approach involves constructing a statistical fading model of the PDP and establishing a relationship between the distribution parameters and the propagation distance. To theoretically validate the feasibility of our proposed method, we adopt the widely accepted Nakagami-m fading model, which renders traditional estimation methods difficult to apply. Therefore we introduce the Expectation Maximization (EM)-Revisited Vector Approximate Message Passing (ReVAMP) algorithm. This algorithm is specifically designed to handle difficulties in parameter estimation for Gaussian linear models (GLMs) with hidden random variables and intractable posterior distributions. Extensive numerical simulation results have been conducted which exhibit preliminary evidence of the effectiveness of our MPCPDP-based ranging method compared to the received signal strength (RSS)-based method. Moreover, the versatility of the EM-ReVAMP algorithm allows for its extension to other statistical fading models beyond the Nakagami-m model with minor modifications, which opens the door to potential improvements based on more accurate statistical fading models. Nevertheless, the applicability of our MPCPDP-based ranging method should be validated in real-world scenarios in future studies.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 5","pages":"950-963"},"PeriodicalIF":8.7,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Brendel;Nicola Pia;Kishan Gupta;Lyonel Behringer;Guillaume Fuchs;Markus Multrus
{"title":"Neural Speech Coding for Real-Time Communications Using Constant Bitrate Scalar Quantization","authors":"Andreas Brendel;Nicola Pia;Kishan Gupta;Lyonel Behringer;Guillaume Fuchs;Markus Multrus","doi":"10.1109/JSTSP.2024.3491575","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3491575","url":null,"abstract":"Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short-Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1462-1476"},"PeriodicalIF":8.7,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations","authors":"Xue Jiang;Xiulian Peng;Yuan Zhang;Yan Lu","doi":"10.1109/JSTSP.2024.3488557","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3488557","url":null,"abstract":"Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm. However, semantic tokens discard paralinguistic attributes of speakers that is important for natural spoken communication, while prompt-based acoustic synthesis from semantic tokens has limits in recovering paralinguistic details and suffers from robustness issues, especially when there are domain gaps between the prompt and the target. This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech, including linguistic and paralinguistic information, into a compact and semantically-disentangled unified token. Such a unified token can not only benefit speech language models in understanding with paralinguistic hints but also help speech generation with high-quality output. A low-bitrate neural codec is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features. Extensive evaluations on multilingual datasets demonstrate its effectiveness in generating natural, expressive and long-term consistent output quality with paralinguistic attributes well preserved in several speech processing tasks.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1477-1489"},"PeriodicalIF":8.7,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Unified Activity Detection Framework for Massive Access: Beyond the Block-Fading Paradigm","authors":"Jianan Bai;Erik G. Larsson","doi":"10.1109/JSTSP.2024.3486200","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3486200","url":null,"abstract":"The wireless channel changes continuously with time and frequency and the block-fading assumption, which is popular in many theoretical analyses, never holds true in practical scenarios. This discrepancy is critical for user activity detection in grant-free random access, where joint processing across multiple coherence blocks is undesirable, especially when the environment becomes more dynamic. In this paper, we develop a framework for low-dimensional approximation of the channel to capture its variations over time and frequency, and use this framework to implement robust activity detection algorithms. Furthermore, we investigate how to efficiently estimate the principal subspace that defines the low-dimensional approximation. We also examine pilot hopping as a way of exploiting time and frequency diversity in scenarios with limited channel coherence, and extend our algorithms to this case. Through numerical examples, we demonstrate a substantial performance improvement achieved by our proposed framework.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 7","pages":"1366-1380"},"PeriodicalIF":8.7,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}