IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第2页

Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation 基于低库和稀疏插值的可扩展复杂度转向响应功率

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-11 DOI: 10.1109/TASLP.2024.3496317

Thomas Dietzen;Enzo De Sena;Toon van Waterschoot

{"title":"Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation","authors":"Thomas Dietzen;Enzo De Sena;Toon van Waterschoot","doi":"10.1109/TASLP.2024.3496317","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3496317","url":null,"abstract":"The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5024-5039"},"PeriodicalIF":4.1,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification 低资源口语识别的跨语料库泛化研究

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-08 DOI: 10.1109/TASLP.2024.3492807

Spandan Dey;Md Sahidullah;Goutam Saha

{"title":"Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification","authors":"Spandan Dey;Md Sahidullah;Goutam Saha","doi":"10.1109/TASLP.2024.3492807","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492807","url":null,"abstract":"Low-resource spoken language identification (LID) systems are prone to poor generalization across unknown domains. In this study, using multiple widely used low-resourced South Asian LID corpora, we conduct an in-depth analysis for understanding the key non-lingual bias factors that create corpora mismatch and degrade LID generalization. To quantify the biases, we extract different data-driven and rule-based summary vectors that capture non-lingual aspects, such as speaker characteristics, spoken context, accents or dialects, recording channels, background noise, and environments. We then conduct a statistical analysis to identify the most crucial non-lingual bias factors and corpora mismatch components that impact LID performance. Following these analyses, we then propose effective bias compensation approaches for the most relevant summary vectors. We generate pseudo-labels using hierarchical clustering over language-domain-gender constrained summary vectors and use them to train adversarial networks with conditioned metric loss. The compensations learn invariance for the corpora mismatches due to the non-lingual biases and help to improve the generalization. With the proposed compensation method, we improve equal error rate up to 5.22% and 8.14% for the same-corpora and cross-corpora evaluations, respectively.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5040-5050"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features 利用基于变压器的声学特征框架增强语音水印的鲁棒性

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-08 DOI: 10.1109/TASLP.2024.3486206

Chuxuan Tong;Iynkaran Natgunanathan;Yong Xiang;Jianhua Li;Tianrui Zong;Xi Zheng;Longxiang Gao

{"title":"Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features","authors":"Chuxuan Tong;Iynkaran Natgunanathan;Yong Xiang;Jianhua Li;Tianrui Zong;Xi Zheng;Longxiang Gao","doi":"10.1109/TASLP.2024.3486206","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486206","url":null,"abstract":"Digital watermarking serves as an effective approach for safeguarding speech signal copyrights, achieved by the incorporation of ownership information into the original signal and its subsequent extraction from the watermarked signal. While traditional watermarking methods can embed and extract watermarks successfully when the watermarked signals are not exposed to severe alterations, these methods cannot withstand attacks such as de-synchronization. In this work, we introduce a novel transformer-based framework designed to enhance the imperceptibility and robustness of speech watermarking. This framework incorporates encoders and decoders built on multi-scale transformer blocks to effectively capture local and long-range features from inputs, such as acoustic features extracted by Short-Time Fourier Transformation (STFT). Further, a deep neural networks (DNNs) based generator, notably the Transformer architecture, is employed to adaptively embed imperceptible watermarks. These perturbations serve as a step for simulating noise, thereby bolstering the watermark robustness during the training phase. Experimental results show the superiority of our proposed framework in terms of watermark imperceptibility and robustness against various watermark attacks. When compared to the currently available related techniques, the framework exhibits an eightfold increase in embedding rate. Further, it also presents superior practicality with scalability and reduced inference time of DNN models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4822-4837"},"PeriodicalIF":4.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection FTDKD：用于低质量压缩音频深度伪造检测的频率-时间域知识提炼

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-07 DOI: 10.1109/TASLP.2024.3492796

Bo Wang;Yeling Tang;Fei Wei;Zhongjie Ba;Kui Ren

{"title":"FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection","authors":"Bo Wang;Yeling Tang;Fei Wei;Zhongjie Ba;Kui Ren","doi":"10.1109/TASLP.2024.3492796","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3492796","url":null,"abstract":"In recent years, the field of audio deepfake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world scenarios. Low-quality compressed audio typically suffers from a loss of high-frequency details and time-domain information, which significantly undermines the performance of advanced deepfake detection systems when confronted with such data. In this paper, we introduce a deepfake detection model that employs knowledge distillation across the frequency and time domains. Our approach aims to train a teacher model with high-quality data and a student model with low-quality compressed data. Subsequently, we implement frequency-domain and time-domain distillation to facilitate the student model's learning of high-frequency information and time-domain details from the teacher model. Experimental evaluations on the ASVspoof 2019 LA and ASVspoof 2021 DF datasets illustrate the effectiveness of our methodology. On the ASVspoof 2021 DF dataset, which consists of low-quality compressed audio, we achieved an Equal Error Rate (EER) of 2.82%. To our knowledge, this performance is the best among all deepfake voice detection systems tested on the ASVspoof 2021 DF dataset. Additionally, our method proves to be versatile, showing notable performance on high-quality data with an EER of 0.30% on the ASVspoof 2019 LA dataset, closely approaching state-of-the-art results.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4905-4918"},"PeriodicalIF":4.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling ELSF：用于联合多重意图检测和空隙填充的实体级空隙填充框架

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-07 DOI: 10.1109/TASLP.2024.3492800

Zhanbiao Zhu;Peijie Huang;Haojing Huang;Yuhong Xu;Piyuan Lin;Leyi Lao;Shaoshen Chen;Haojie Xie;Shangjian Yin

引用次数: 0

Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models 基于注意力的编码器-解码器模型的正确误差估计和校准

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-06 DOI: 10.1109/TASLP.2024.3492799

Mun-Hak Lee;Joon-Hyuk Chang

引用次数: 0

TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation TF-CrossNet：利用全局、跨带、窄带和位置编码实现单声道和多声道扬声器分离

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-06 DOI: 10.1109/TASLP.2024.3492803

Vahid Ahmadi Kalkhorani;DeLiang Wang

引用次数: 0

FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow 流式散列：通过规范化流量平衡散列加速音频搜索

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-04 DOI: 10.1109/TASLP.2024.3486227

Anup Singh;Kris Demuynck;Vipul Arora

{"title":"FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow","authors":"Anup Singh;Kris Demuynck;Vipul Arora","doi":"10.1109/TASLP.2024.3486227","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3486227","url":null,"abstract":"Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent \u0000<inline-formula><tex-math>$mathbb {R}^{K}$</tex-math></inline-formula>\u0000 space using normalizing flow, resulting in balanced \u0000<inline-formula><tex-math>$K$</tex-math></inline-formula>\u0000-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible \u0000<inline-formula><tex-math>$2^{K}$</tex-math></inline-formula>\u0000 hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries (\u0000<inline-formula><tex-math>$leq$</tex-math></inline-formula>\u00002 s), particularly at high noise and reverberation levels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4961-4970"},"PeriodicalIF":4.1,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding and Mitigating the Uncertainty in Zero-Shot Translation 了解并减少零镜头翻译中的不确定性

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-31 DOI: 10.1109/TASLP.2024.3485555

Wenxuan Wang;Wenxiang Jiao;Shuo Wang;Zhaopeng Tu;Michael R. Lyu

引用次数: 0

MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning MRC-PASCL：通过后训练和以答案跨度为导向的对比学习实现快速机器阅读理解的方法

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-31 DOI: 10.1109/TASLP.2024.3490373

Ren Li;Qiao Xiao;Jianxi Yang;Luyi Zhang;Yu Chen

{"title":"MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning","authors":"Ren Li;Qiao Xiao;Jianxi Yang;Luyi Zhang;Yu Chen","doi":"10.1109/TASLP.2024.3490373","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3490373","url":null,"abstract":"The rapid development of pre-trained language models (PLMs) has significantly enhanced the performance of machine reading comprehension (MRC). Nevertheless, the traditional fine-tuning approaches necessitate extensive labeled data. MRC remains a challenging task in the few-shot settings or low-resource scenarios. This study proposes a novel few-shot MRC approach via post-training and answer span-oriented contrastive learning, termed MRC-PASCL. Specifically, in the post-training module, a novel noun-entity-aware data selection and generation strategy is proposed according to characteristics of MRC task and data, focusing on masking nouns and named entities in the context. In terms of fine-tuning, the proposed answer span-oriented contrastive learning manner selects spans around the golden answers as negative examples, and performs multi-task learning together with the standard MRC answer prediction task. Experimental results show that MRC-PASCL outperforms the PLMs-based baseline models and the 7B and 13B large language models (LLMs) cross most MRQA 2019 datasets. Further analyses show that our approach achieves better inference efficiency with lower computational resource requirement. The analysis results also indicate that the proposed method can better adapt to the domain-specific scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4838-4849"},"PeriodicalIF":4.1,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0