{"title":"IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization","authors":"Yabo Wang;Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3507560","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507560","url":null,"abstract":"Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is adopted for DP-IPD estimation, in which combined narrow-band and full-band layers are responsible for estimating the raw DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of a flexible number of sound sources. Third, the network is extended to handle variable microphone arrays. This version of IPDnet is trained with a large set of different microphone arrays, and then it is able to infer the source locations using new microphone arrays not seen at training time. Experiments with multiple number of moving speakers are conducted on both simulated and real-world data, which show that the full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieve excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5051-5064"},"PeriodicalIF":4.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach","authors":"Eloi Moliner;Filip Elvander;Vesa Välimäki","doi":"10.1109/TASLP.2024.3507566","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507566","url":null,"abstract":"Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: \u0000<uri>http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/</uri>","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5092-5105"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768977","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation","authors":"Sufeng Duan;Hai Zhao","doi":"10.1109/TASLP.2024.3507556","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507556","url":null,"abstract":"In this paper, we propose an explanation of representation for self-attention network (SAN) based neural sequence encoders, which regards the information captured by the model and the encoding of the model as graph structure and the generation of these graph structures respectively. The proposed explanation applies to existing works on SAN-based models and can explain the relationship among the ability to capture the structural or linguistic information, depth of model, and length of sentence, and can also be extended to other models such as recurrent neural network based models. We also propose a revisited multigraph called Multi-order-Graph (MoG) based on our explanation to model the graph structures in the SAN-based model as subgraphs in MoG and convert the encoding of the SAN-based model to the generation of MoG. Based on our explanation, we further introduce an MO-Transformer by enhancing the ability to capture multiple subgraphs of different orders and focusing on subgraphs of high orders. Experimental results on multiple neural machine translation tasks show that the MO-Transformer can yield effective performance improvement.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5065-5077"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Neural Speaker Diarization With Target Speaker Tracking","authors":"Weiqing Wang;Ming Li","doi":"10.1109/TASLP.2024.3507559","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507559","url":null,"abstract":"This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5078-5091"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System","authors":"Wei-Cheng Lin;Kusha Sridhar;Carlos Busso","doi":"10.1109/TASLP.2024.3507562","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507562","url":null,"abstract":"It is difficult to achieve robust and well-generalized models for tasks involving subjective concepts such as emotion. It is inevitable to deal with noisy labels, given the ambiguous nature of human perception. Methodologies relying on \u0000<italic>semi-supervised learning</i>\u0000 (SSL) and curriculum learning have been proposed to enhance the generalization of the models. This study proposes a novel \u0000<italic>deep mutual information</i>\u0000 (DeepMI) metric, built with the SSL pre-trained DeepEmoCluster framework to establish the difficulty of samples. The DeepMI metric quantifies the relationship between the acoustic patterns and emotional attributes (e.g., arousal, valence, and dominance). The DeepMI metric provides a better curriculum, achieving state-of-the-art performance that is higher than results obtained with existing curriculum metrics for \u0000<italic>speech emotion recognition</i>\u0000 (SER). We evaluate the proposed method with three emotional datasets in matched and mismatched testing conditions. The experimental evaluations systematically show that a model trained with the DeepMI metric not only obtains competitive generalization performances, but also maintains convergence stability. Furthermore, the extracted DeepMI values are highly interpretable, reflecting information ranks of the training samples.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5117-5130"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768985","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models","authors":"Taegyun Kwon;Dasaem Jeong;Juhan Nam","doi":"10.1109/TASLP.2024.3507568","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507568","url":null,"abstract":"In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time piano transcription with a focus on achieving both high performance and a lightweight model. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we demonstrate that the proposed components are necessary for achieving high performance in an autoregressive model. Additionally, we provide experiments on real-time latency.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5106-5116"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li
{"title":"Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning","authors":"Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li","doi":"10.1109/TASLP.2024.3492793","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492793","url":null,"abstract":"Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this overview, we present a comprehensive review of neural approaches to speaker representation learning from both theoretical and practical perspectives. Theoretically, we discuss speaker encoders ranging from supervised to self-supervised learning algorithms, standalone models to large pretrained models, pure speaker embedding learning to joint optimization with downstream tasks, and efforts toward interpretability. Practically, we systematically examine approaches for robustness and effectiveness, introduce and compare various open-source toolkits in the field. Through the systematic and comprehensive review of the relevant literature, research activities, and resources, we provide a clear reference for researchers in the speaker characterization and modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4971-4998"},"PeriodicalIF":4.1,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu
{"title":"CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction","authors":"Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu","doi":"10.1109/TASLP.2024.3497586","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3497586","url":null,"abstract":"Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4945-4960"},"PeriodicalIF":4.1,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis","authors":"Efthymios Georgiou;Yannis Avrithis;Alexandros Potamianos","doi":"10.1109/TASLP.2024.3496316","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496316","url":null,"abstract":"Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. Despite significant progress in multimodal architecture design, the field lacks comprehensive regularization methods. This paper introduces \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches and introduces novel algorithmic components that are specifically tailored to multimodal tasks. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix consists of five components: 1) a varying number of generated mixed examples, 2) mixing factor reweighting, 3) anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing. Extensive experimentation across benchmark MSA datasets and a broad spectrum of diverse architectural designs demonstrate the efficacy of \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, as evidenced by consistent performance improvements over baselines and existing mixing methods. An in-depth ablation study highlights the critical contribution of each \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix component and how they synergistically enhance performance. Furthermore, algorithmic analysis demonstrates how \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix behaves in different scenarios, particularly comparing early versus late fusion architectures. Notably, \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix enhances overall performance without sacrificing model robustness or magnifying text dominance. It also retains its strong performance in situations of limited data. Our findings position \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix as a promising versatile regularization strategy for MSA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5010-5023"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation","authors":"Thomas Dietzen;Enzo De Sena;Toon van Waterschoot","doi":"10.1109/TASLP.2024.3496317","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496317","url":null,"abstract":"The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5024-5039"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}