Kyungdeuk Ko;Bokyeung Lee;Jonghwan Hong;Hanseok Ko
{"title":"KFA: Keyword Feature Augmentation for Open Set Keyword Spotting","authors":"Kyungdeuk Ko;Bokyeung Lee;Jonghwan Hong;Hanseok Ko","doi":"10.1109/LSP.2024.3484932","DOIUrl":"https://doi.org/10.1109/LSP.2024.3484932","url":null,"abstract":"In recent years, with the advancement of deep learning technology and the emergence of smart devices, there has been a growing interest in keyword spotting (KWS), which is used to activate AI systems with automatic speech recognition and text-to-speech. However, smart devices with KWS often encounter false alarm errors when inputting unexpected words. To address this issue, existing KWS methods typically train non-target words as an \u0000<italic>unknown</i>\u0000 class. Despite these efforts, there is still a possibility that unseen words not trained as part of the \u0000<italic>unknown</i>\u0000 class could be misclassified as one of the target words. To overcome this limitation, we propose a new method named Keyword Feature Augmentation (KFA) for open-set KWS. KFA performs feature augmentation through adversarial learning to increase the loss. The augmented features are constrained within a limited space using label smoothing. Unlike other generative model-based open set recognition (OSR) methods, KFA does not require any additional training parameters or repeated operation for inference. As a result, KFA has achieved a 0.955 AUROC score and 97.34% target class accuracy for Google Speech Commands V1, and a 0.959 AUROC score and 98.17% target class accuracy for Google Speech Commands V2, which is the highest performance when compared to various OSR methods.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehmet Hamza Erol;Arda Senocak;Jiu Feng;Joon Son Chung
{"title":"Audio Mamba: Bidirectional State Space Model for Audio Representation Learning","authors":"Mehmet Hamza Erol;Arda Senocak;Jiu Feng;Joon Son Chung","doi":"10.1109/LSP.2024.3483009","DOIUrl":"https://doi.org/10.1109/LSP.2024.3483009","url":null,"abstract":"Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"System-Informed Neural Network for Frequency Detection","authors":"Sunyoung Ko;Myoungin Shin;Geunhwan Kim;Youngmin Choo","doi":"10.1109/LSP.2024.3483036","DOIUrl":"https://doi.org/10.1109/LSP.2024.3483036","url":null,"abstract":"We contrive a deep learning-based frequency analysis scheme called system-informed neural network (SINN) by considering the corresponding linear system model. SINN adopts the adaptive learned iterative soft shrinkage algorithm as the NN architecture and includes the system model in loss function. It has good generalization with fast processing time and finds a solution that satisfies the system model as a physics-informed neural network. To further improve SINN, multiple measurements are exploited by assuming the existence of common frequency components over the measurements. SINN is examined using simulated acoustic data, and the performance is compared to Fourier transform and sparse Bayesian learning (SBL) in terms of the detection/false alarm rate and mean squared error. SINN exhibits clear frequency components in in-situ data tests, as in SBL, by reducing noise effectively. Finally, SINN is applied to noisy passive sonar signals, which include 43 frequency components, and many are recovered.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RFI-Aware and Low-Cost Maximum Likelihood Imaging for High-Sensitivity Radio Telescopes","authors":"J. Wang;M. N. El Korso;L. Bacharach;P. Larzabal","doi":"10.1109/LSP.2024.3483011","DOIUrl":"https://doi.org/10.1109/LSP.2024.3483011","url":null,"abstract":"This paper addresses the challenge of interference mitigation and reduction of computational cost in the context of radio interferometric imaging. We propose a novel maximum-likelihood-based methodology based on the antenna sub-array switching technique, which strikes a refined balance between imaging accuracy and computational efficiency. In addition, we tackle robustness regarding radio interference by modeling the additive noise as t-distributed. Through simulation results, we demonstrate the superiority of the t-distributed noise model over the conventional Gaussian noise model in scenarios involving interferences. We evidence that our proposed switching approach yields similar imaging performances with far fewer visibilities compared to the full array configuration, thus, diminishing the computational complexity.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deijany Rodriguez Linares;Håkan Johansson;Yinan Wang
{"title":"Order Estimation of Linear-Phase FIR Filters for DAC Equalization in Multiple Nyquist Bands","authors":"Deijany Rodriguez Linares;Håkan Johansson;Yinan Wang","doi":"10.1109/LSP.2024.3483008","DOIUrl":"https://doi.org/10.1109/LSP.2024.3483008","url":null,"abstract":"This letter considers the design and properties of linear-phase finite-length impulse response (FIR) filters for equalization of the frequency responses of digital-to-analog converters (DACs). The letter derives estimates for the filter orders required, as functions of the bandwidth and equalization accuracy, for four DAC pulses that are used in DACs in multiple Nyquist bands. The estimates are derived through a large set of minimax-optimal equalizers and the use of symbolic regression followed by minimax-optimal curve fitting for further enhancement. Design examples demonstrate the accuracy of the proposed estimates. In addition, the letter discusses the appropriateness of the four types of linear-phase FIR filters, for the different equalizer cases, as well as the corresponding properties of the equalized systems.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142525780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Noise Adapters for Incremental Speech Enhancement","authors":"Ziye Yang;Xiang Song;Jie Chen;Cédric Richard;Israel Cohen","doi":"10.1109/LSP.2024.3482171","DOIUrl":"https://doi.org/10.1109/LSP.2024.3482171","url":null,"abstract":"Incremental speech enhancement (ISE), with the ability to incrementally adapt to new noise domains, represents a critical yet comparatively under-investigated topic. While the regularization-based method has been proposed to solve the ISE task, it usually suffers from the dilemma wherein the gain of one domain directly entails the loss of another. To solve this issue, we propose an effective paradigm, termed Learning Noise Adapters (LNA), which significantly mitigates the catastrophic domain forgetting phenomenon in the ISE task. In our methodology, we employ a frozen pre-trained model to train and retain a domain-specific adapter for each newly encountered domain, enabling the capture of variations in feature distributions within these domains. Subsequently, our approach involves the development of an unsupervised, training-free noise selector for the inference stage, which is responsible for identifying the domains of test speech samples. A comprehensive experimental validation has substantiated the effectiveness of our approach.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum Entropy and Quantized Metric Models for Absolute Category Ratings","authors":"Dietmar Saupe;Krzysztof Rusek;David Hägele;Daniel Weiskopf;Lucjan Janowski","doi":"10.1109/LSP.2024.3480832","DOIUrl":"https://doi.org/10.1109/LSP.2024.3480832","url":null,"abstract":"The datasets of most image quality assessment studies contain ratings on a categorical scale with five levels, from bad (1) to excellent (5). For each stimulus, the number of ratings from 1 to 5 is summarized and given in the form of the mean opinion score. In this study, we investigate families of multinomial probability distributions parameterized by mean and variance that are used to fit the empirical rating distributions. To this end, we consider quantized metric models based on continuous distributions that model perceived stimulus quality on a latent scale. The probabilities for the rating categories are determined by quantizing the corresponding random variables using threshold values. Furthermore, we introduce a novel discrete maximum entropy distribution for a given mean and variance. We compare the performance of these models and the state of the art given by the generalized score distribution for two large data sets, KonIQ-10k and VQEG HDTV. Given an input distribution of ratings, our fitted two-parameter models predict unseen ratings better than the empirical distribution. In contrast to empirical distributions of absolute category ratings and their discrete models, our continuous models can provide fine-grained estimates of quantiles of quality of experience that are relevant to service providers to satisfy a certain fraction of the user population.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pose-Promote: Progressive Visual Perception for Activities of Daily Living","authors":"Qilang Ye;Zitong Yu","doi":"10.1109/LSP.2024.3480046","DOIUrl":"https://doi.org/10.1109/LSP.2024.3480046","url":null,"abstract":"Poses are effective in interpreting fine-grained human activities, especially when encountering complex visual information. Unimodal methods for action recognition unsatisfactorily to daily activities due to the lack of a more comprehensive perspective. Multimodal methods to combine pose and visual are still not exhaustive enough in mining complementary information. Therefore, we propose a Pose-promote (Ppromo) framework that utilizes a priori knowledge of pose joints to perceive visual information progressively. We first introduce a temporal promote module to activate each video segment using temporally synchronized joint weights. Then a spatial promote module is proposed to capture the key regions in visuals using the learned pose attentions. To further refine the bimodal associations, the global inter-promote module is proposed to align global pose-visual semantics at the feature granularity. Finally, a learnable late fusion strategy between visual and pose is applied for accurate inference. Ppromo achieves state-of-the-art performance on three publicly available datasets.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Gao;Mingfeng Yin;Yuanzhi Ni;Yuming Bo;Shaoyi Bei
{"title":"Learning Multidimensional Spatial Attention for Robust Nighttime Visual Tracking","authors":"Qi Gao;Mingfeng Yin;Yuanzhi Ni;Yuming Bo;Shaoyi Bei","doi":"10.1109/LSP.2024.3480831","DOIUrl":"https://doi.org/10.1109/LSP.2024.3480831","url":null,"abstract":"The recent development of advanced trackers, which use nighttime image enhancement technology, has led to marked advances in the performance of visual tracking at night. However, the images recovered by currently available enhancement methods still have some weaknesses, such as blurred target details and obvious image noise. To this end, we propose a novel method for learning multidimensional spatial attention for robust nighttime visual tracking, which is developed over a spatial channel transformer based low light enhancer (SCT), named MSA-SCT. First, a novel multidimensional spatial attention (MSA) is designed. Additional reliable feature responses are generated by aggregating channel and multi-scale spatial information, thus making the model more adaptable to illumination conditions and noise levels in different regions of the image. Second, with optimized skip connections, the effects of redundant information and noise can be limited, which is more useful for the propagation of fine detail features in nighttime images from low to high level features and improves the enhancement effect. Finally, the tracker with enhancers was tested on multiple tracking benchmarks to fully demonstrate the effectiveness and superiority of MSA-SCT.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Recurrent Spatio-Temporal Graph Neural Network Based on Latent Time Graph for Multi-Channel Time Series Forecasting","authors":"Linzhi Li;Xiaofeng Zhou;Guoliang Hu;Shuai Li;Dongni Jia","doi":"10.1109/LSP.2024.3479917","DOIUrl":"https://doi.org/10.1109/LSP.2024.3479917","url":null,"abstract":"With the advancement of technology, the field of multi-channel time series forecasting has emerged as a focal point of research. In this context, spatio-temporal graph neural networks have attracted significant interest due to their outstanding performance. An established approach involves integrating graph convolutional networks into recurrent neural networks. However, this approach faces difficulties in capturing dynamic spatial correlations and discerning the correlation of multi-channel time series signals. Another major problem is that the discrete time interval of recurrent neural networks limits the accuracy of spatio-temporal prediction. To address these challenges, we propose a continuous spatio-temporal framework, termed Recurrent Spatio-Temporal Graph Neural Network based on Latent Time Graph (RST-LTG). RST-LTG incorporates adaptive graph convolution networks with a time embedding generator to construct a latent time graph, which subtly captures evolving spatial characteristics by aggregating spatial information across multiple time steps. Additionally, to improve the accuracy of continuous time modeling, we introduce a gate enhanced neural ordinary differential equation that effectively integrates information across multiple scales. Empirical results on four publicly available datasets demonstrate that the RST-LTG model outperforms 19 competing methods in terms of accuracy.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142452677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}