IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

List of Reviewers 审稿人名单

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2025-01-08 DOI: 10.1109/TASLP.2024.3520736

引用次数: 0

IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization 一种用于声源定位的通用直接路径IPD估计网络

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-28 DOI: 10.1109/TASLP.2024.3507560

Yabo Wang;Bing Yang;Xiaofei Li

{"title":"IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization","authors":"Yabo Wang;Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3507560","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507560","url":null,"abstract":"Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is adopted for DP-IPD estimation, in which combined narrow-band and full-band layers are responsible for estimating the raw DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of a flexible number of sound sources. Third, the network is extended to handle variable microphone arrays. This version of IPDnet is trained with a large set of different microphone arrays, and then it is able to infer the source locations using new microphone arrays not seen at training time. Experiments with multiple number of moving speakers are conducted on both simulated and real-world data, which show that the full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieve excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5051-5064"},"PeriodicalIF":4.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation MO-Transformer：用于神经机器翻译的词间高级关系提取

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507556

Sufeng Duan;Hai Zhao

{"title":"MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation","authors":"Sufeng Duan;Hai Zhao","doi":"10.1109/TASLP.2024.3507556","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507556","url":null,"abstract":"In this paper, we propose an explanation of representation for self-attention network (SAN) based neural sequence encoders, which regards the information captured by the model and the encoding of the model as graph structure and the generation of these graph structures respectively. The proposed explanation applies to existing works on SAN-based models and can explain the relationship among the ability to capture the structural or linguistic information, depth of model, and length of sentence, and can also be extended to other models such as recurrent neural network based models. We also propose a revisited multigraph called Multi-order-Graph (MoG) based on our explanation to model the graph structures in the SAN-based model as subgraphs in MoG and convert the encoding of the SAN-based model to the generation of MoG. Based on our explanation, we further introduce an MO-Transformer by enhancing the ability to capture multiple subgraphs of different orders and focusing on subgraphs of high orders. Experimental results on multiple neural machine translation tasks show that the MO-Transformer can yield effective performance improvement.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5065-5077"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach 盲音频带宽扩展：基于扩散的零镜头方法

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507566

Eloi Moliner;Filip Elvander;Vesa Välimäki

{"title":"Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach","authors":"Eloi Moliner;Filip Elvander;Vesa Välimäki","doi":"10.1109/TASLP.2024.3507566","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507566","url":null,"abstract":"Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: \u0000<uri>http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/</uri>","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5092-5105"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768977","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Neural Speaker Diarization With Target Speaker Tracking 基于目标说话人跟踪的在线神经说话人划分

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507559

Weiqing Wang;Ming Li

{"title":"Online Neural Speaker Diarization With Target Speaker Tracking","authors":"Weiqing Wang;Ming Li","doi":"10.1109/TASLP.2024.3507559","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507559","url":null,"abstract":"This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting conventional TS-VAD for real-time operation, our framework identifies speaker activities using self-generated embeddings, ensuring consistent performance and avoiding permutation inconsistencies during inference. In the inference phase, we employ a front-end model to extract frame-level speaker embeddings for each incoming signal block. Subsequently, we predict each speaker's detection state based on these frame-level embeddings and the previously estimated target speaker embeddings. The target speaker embeddings are then updated by aggregating the frame-level embeddings according to the current block's predictions. Our model predicts results block-by-block and iteratively updates target speaker embeddings until reaching the end of the signal. Experimental results demonstrate that the proposed method outperforms offline clustering-based diarization systems on the DIHARD III and AliMeeting datasets. Additionally, this approach is extended to multi-channel data, achieving comparable performance to state-of-the-art offline diarization systems.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5078-5091"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System 鲁棒广义语音情感识别系统的可解释深度互信息课程度量

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507562

Wei-Cheng Lin;Kusha Sridhar;Carlos Busso

{"title":"An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System","authors":"Wei-Cheng Lin;Kusha Sridhar;Carlos Busso","doi":"10.1109/TASLP.2024.3507562","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507562","url":null,"abstract":"It is difficult to achieve robust and well-generalized models for tasks involving subjective concepts such as emotion. It is inevitable to deal with noisy labels, given the ambiguous nature of human perception. Methodologies relying on \u0000<italic>semi-supervised learning</i>\u0000 (SSL) and curriculum learning have been proposed to enhance the generalization of the models. This study proposes a novel \u0000<italic>deep mutual information</i>\u0000 (DeepMI) metric, built with the SSL pre-trained DeepEmoCluster framework to establish the difficulty of samples. The DeepMI metric quantifies the relationship between the acoustic patterns and emotional attributes (e.g., arousal, valence, and dominance). The DeepMI metric provides a better curriculum, achieving state-of-the-art performance that is higher than results obtained with existing curriculum metrics for \u0000<italic>speech emotion recognition</i>\u0000 (SER). We evaluate the proposed method with three emotional datasets in matched and mismatched testing conditions. The experimental evaluations systematically show that a model trained with the DeepMI metric not only obtains competitive generalization performances, but also maintains convergence stability. Furthermore, the extracted DeepMI values are highly interpretable, reflecting information ranks of the training samples.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5117-5130"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10768985","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models 利用神经自回归模型实现高效实时钢琴转录

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-27 DOI: 10.1109/TASLP.2024.3507568

Taegyun Kwon;Dasaem Jeong;Juhan Nam

{"title":"Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models","authors":"Taegyun Kwon;Dasaem Jeong;Juhan Nam","doi":"10.1109/TASLP.2024.3507568","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3507568","url":null,"abstract":"In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time piano transcription with a focus on achieving both high performance and a lightweight model. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we demonstrate that the proposed components are necessary for achieving high performance in an autoregressive model. Additionally, we provide experiments on real-time latency.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5106-5116"},"PeriodicalIF":4.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning 扬声器建模及其应用概述：从深度扬声器表征学习的角度看扬声器建模及其应用

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-21 DOI: 10.1109/TASLP.2024.3492793

Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li

{"title":"Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning","authors":"Shuai Wang;Zhengyang Chen;Kong Aik Lee;Yanmin Qian;Haizhou Li","doi":"10.1109/TASLP.2024.3492793","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492793","url":null,"abstract":"Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this overview, we present a comprehensive review of neural approaches to speaker representation learning from both theoretical and practical perspectives. Theoretically, we discuss speaker encoders ranging from supervised to self-supervised learning algorithms, standalone models to large pretrained models, pure speaker embedding learning to joint optimization with downstream tasks, and efforts toward interpretability. Practically, we systematically examine approaches for robustness and effectiveness, introduce and compare various open-source toolkits in the field. Through the systematic and comprehensive review of the relevant literature, research activities, and resources, we provide a clear reference for researchers in the speaker characterization and modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4971-4998"},"PeriodicalIF":4.1,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction CLAPSep：利用对比预训练模型进行多模态查询条件下的目标声音提取

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-13 DOI: 10.1109/TASLP.2024.3497586

Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu

{"title":"CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction","authors":"Hao Ma;Zhiyuan Peng;Xu Li;Mingjie Shao;Xixin Wu;Ju Liu","doi":"10.1109/TASLP.2024.3497586","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3497586","url":null,"abstract":"Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4945-4960"},"PeriodicalIF":4.1,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

$mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis $mathcal {P}$owMix：用于多模态情感分析的多功能正则化器

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-11 DOI: 10.1109/TASLP.2024.3496316

Efthymios Georgiou;Yannis Avrithis;Alexandros Potamianos

{"title":"$mathcal {P}$owMix: A Versatile Regularizer for Multimodal Sentiment Analysis","authors":"Efthymios Georgiou;Yannis Avrithis;Alexandros Potamianos","doi":"10.1109/TASLP.2024.3496316","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496316","url":null,"abstract":"Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. Despite significant progress in multimodal architecture design, the field lacks comprehensive regularization methods. This paper introduces \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches and introduces novel algorithmic components that are specifically tailored to multimodal tasks. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer. \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix consists of five components: 1) a varying number of generated mixed examples, 2) mixing factor reweighting, 3) anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing. Extensive experimentation across benchmark MSA datasets and a broad spectrum of diverse architectural designs demonstrate the efficacy of \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix, as evidenced by consistent performance improvements over baselines and existing mixing methods. An in-depth ablation study highlights the critical contribution of each \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix component and how they synergistically enhance performance. Furthermore, algorithmic analysis demonstrates how \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix behaves in different scenarios, particularly comparing early versus late fusion architectures. Notably, \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix enhances overall performance without sacrificing model robustness or magnifying text dominance. It also retains its strong performance in situations of limited data. Our findings position \u0000<inline-formula><tex-math>$mathcal {P}$</tex-math></inline-formula>\u0000owMix as a promising versatile regularization strategy for MSA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5010-5023"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0