{"title":"Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification","authors":"Spandan Dey;Md Sahidullah;Goutam Saha","doi":"10.1109/TASLP.2024.3492807","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492807","url":null,"abstract":"Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5040-5050"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features","authors":"Chuxuan Tong;Iynkaran Natgunanathan;Yong Xiang;Jianhua Li;Tianrui Zong;Xi Zheng;Longxiang Gao","doi":"10.1109/TASLP.2024.3486206","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486206","url":null,"abstract":"Digital watermarking serves as an effective approach for safeguarding speech signal copyrights, achieved by the incorporation of ownership information into the original signal and its subsequent extraction from the watermarked signal. While traditional watermarking methods can embed and extract watermarks successfully when the watermarked signals are not exposed to severe alterations, these methods cannot withstand attacks such as de-synchronization. In this work, we introduce a novel transformer-based framework designed to enhance the imperceptibility and robustness of speech watermarking. This framework incorporates encoders and decoders built on multi-scale transformer blocks to effectively capture local and long-range features from inputs, such as acoustic features extracted by Short-Time Fourier Transformation (STFT). Further, a deep neural networks (DNNs) based generator, notably the Transformer architecture, is employed to adaptively embed imperceptible watermarks. These perturbations serve as a step for simulating noise, thereby bolstering the watermark robustness during the training phase. Experimental results show the superiority of our proposed framework in terms of watermark imperceptibility and robustness against various watermark attacks. When compared to the currently available related techniques, the framework exhibits an eightfold increase in embedding rate. Further, it also presents superior practicality with scalability and reduced inference time of DNN models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4822-4837"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection","authors":"Bo Wang;Yeling Tang;Fei Wei;Zhongjie Ba;Kui Ren","doi":"10.1109/TASLP.2024.3492796","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492796","url":null,"abstract":"In recent years, the field of audio deepfake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world scenarios. Low-quality compressed audio typically suffers from a loss of high-frequency details and time-domain information, which significantly undermines the performance of advanced deepfake detection systems when confronted with such data. In this paper, we introduce a deepfake detection model that employs knowledge distillation across the frequency and time domains. Our approach aims to train a teacher model with high-quality data and a student model with low-quality compressed data. Subsequently, we implement frequency-domain and time-domain distillation to facilitate the student model's learning of high-frequency information and time-domain details from the teacher model. Experimental evaluations on the ASVspoof 2019 LA and ASVspoof 2021 DF datasets illustrate the effectiveness of our methodology. On the ASVspoof 2021 DF dataset, which consists of low-quality compressed audio, we achieved an Equal Error Rate (EER) of 2.82%. To our knowledge, this performance is the best among all deepfake voice detection systems tested on the ASVspoof 2021 DF dataset. Additionally, our method proves to be versatile, showing notable performance on high-quality data with an EER of 0.30% on the ASVspoof 2019 LA dataset, closely approaching state-of-the-art results.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4905-4918"},"PeriodicalIF":4.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling","authors":"Zhanbiao Zhu;Peijie Huang;Haojing Huang;Yuhong Xu;Piyuan Lin;Leyi Lao;Shaoshen Chen;Haojie Xie;Shangjian Yin","doi":"10.1109/TASLP.2024.3492800","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492800","url":null,"abstract":"Multi-intent spoken language understanding (SLU) that can handle multiple intents in an utterance has attracted increasing attention. Previous studies treat the slot filling task as a token-level sequence labeling task, which results in a lack of entity-related information. In our paper, we propose an \u0000<bold>E</b>\u0000ntity-\u0000<bold>L</b>\u0000evel \u0000<bold>S</b>\u0000lot \u0000<bold>F</b>\u0000illing (ELSF) framework for joint multiple intent detection and slot filling. In our framework, two entity-oriented auxiliary tasks, entity boundary detection and entity type assignment, are introduced as the regularization to capture the entity boundary and the context of type, respectively. Besides, to better utilize the entity interaction, we design an effective entity-level coordination mechanism for modeling the interaction in both entity-entity and intent-entity relationships. Experiments on five datasets demonstrate the effectiveness and generalizability of our ELSF.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4880-4893"},"PeriodicalIF":4.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models","authors":"Mun-Hak Lee;Joon-Hyuk Chang","doi":"10.1109/TASLP.2024.3492799","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492799","url":null,"abstract":"An attention-based automatic speech recognition (ASR) model generates a probability distribution of the tokens set at each time step. Recent studies have shown that calibration errors exist in the output probability distributions of attention-based ASR models trained to minimize the negative log likelihood. This study analyzes the causes of calibration errors in ASR model outputs and their impact on model performance. Based on the analysis, we argue that conventional methods for estimating calibration errors at the token level are unsuitable for ASR tasks. Accordingly, we propose a new calibration measure that estimates the calibration error at the sequence level. Moreover, we present a new post-hoc calibration function and training objective to mitigate the calibration error of the ASR model at the sequence level. Through experiments using the ASR benchmark, we show that the proposed methods effectively alleviate the calibration error of the ASR model and improve the generalization performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4919-4930"},"PeriodicalIF":4.1,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation","authors":"Vahid Ahmadi Kalkhorani;DeLiang Wang","doi":"10.1109/TASLP.2024.3492803","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492803","url":null,"abstract":"We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4999-5009"},"PeriodicalIF":4.1,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow","authors":"Anup Singh;Kris Demuynck;Vipul Arora","doi":"10.1109/TASLP.2024.3486227","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486227","url":null,"abstract":"Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent \u0000<inline-formula><tex-math>$mathbb {R}^{K}$</tex-math></inline-formula>\u0000 space using normalizing flow, resulting in balanced \u0000<inline-formula><tex-math>$K$</tex-math></inline-formula>\u0000-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible \u0000<inline-formula><tex-math>$2^{K}$</tex-math></inline-formula>\u0000 hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries (\u0000<inline-formula><tex-math>$leq$</tex-math></inline-formula>\u00002 s), particularly at high noise and reverberation levels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4961-4970"},"PeriodicalIF":4.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenxuan Wang;Wenxiang Jiao;Shuo Wang;Zhaopeng Tu;Michael R. Lyu
{"title":"Understanding and Mitigating the Uncertainty in Zero-Shot Translation","authors":"Wenxuan Wang;Wenxiang Jiao;Shuo Wang;Zhaopeng Tu;Michael R. Lyu","doi":"10.1109/TASLP.2024.3485555","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485555","url":null,"abstract":"Zero-shottranslation is a promising direction for building a comprehensive multilingual neural machine translation (MNMT) system. However, its quality is still not satisfactory due to off-target issues. In this paper, we aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation. By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues, namely, extrinsic data uncertainty and intrinsic model uncertainty. Based on the observations, we propose two lightweight and complementary approaches to denoise the training data for model training and explicitly penalize the off-target translations by unlikelihood training during model training. Extensive experiments on both balanced and imbalanced datasets show that our approaches significantly improve the performance of zero-shot translation over strong MNMT baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4894-4904"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning","authors":"Ren Li;Qiao Xiao;Jianxi Yang;Luyi Zhang;Yu Chen","doi":"10.1109/TASLP.2024.3490373","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3490373","url":null,"abstract":"The rapid development of pre-trained language models (PLMs) has significantly enhanced the performance of machine reading comprehension (MRC). Nevertheless, the traditional fine-tuning approaches necessitate extensive labeled data. MRC remains a challenging task in the few-shot settings or low-resource scenarios. This study proposes a novel few-shot MRC approach via post-training and answer span-oriented contrastive learning, termed MRC-PASCL. Specifically, in the post-training module, a novel noun-entity-aware data selection and generation strategy is proposed according to characteristics of MRC task and data, focusing on masking nouns and named entities in the context. In terms of fine-tuning, the proposed answer span-oriented contrastive learning manner selects spans around the golden answers as negative examples, and performs multi-task learning together with the standard MRC answer prediction task. Experimental results show that MRC-PASCL outperforms the PLMs-based baseline models and the 7B and 13B large language models (LLMs) cross most MRQA 2019 datasets. Further analyses show that our approach achieves better inference efficiency with lower computational resource requirement. The analysis results also indicate that the proposed method can better adapt to the domain-specific scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4838-4849"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Munukutla L. N. Srinivas Karthik;Joel S.;Nithin V. George
{"title":"FxLMS/F Based Tap Decomposed Adaptive Filter for Decentralized Active Noise Control System","authors":"Munukutla L. N. Srinivas Karthik;Joel S.;Nithin V. George","doi":"10.1109/TASLP.2024.3473294","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473294","url":null,"abstract":"Decentralized systems are appealing due to their reduced complexity and flexibility. A class of decentralized multi-channel active noise control (MCANC) systems has been developed in this paper. In the first part of the study, a modified filtered-x least mean square/fourth (FxLMS/F) algorithm, which offers improved noise reduction performance over the conventional FxLMS/F algorithm, was developed for MCANC. Further, to reduce the computational complexity of the proposed MCANC system, a nearest Kronecker product (NKP) decomposition strategy has been incorporated to develop decentralized versions of FxLMS/F algorithms. The proposed algorithms have been shown to offer enhanced noise reduction at reduced computational complexity when applied for noise control for narrowband noise, bandlimited white noise, traffic noise and wind noise.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4691-4699"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}