TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-06 DOI:10.1109/TASLP.2024.3492803

Vahid Ahmadi Kalkhorani;DeLiang Wang

引用次数: 0

Abstract

We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

查看原文本刊更多论文

TF-CrossNet：利用全局、跨带、窄带和位置编码实现单声道和多声道扬声器分离

我们介绍的 TF-CrossNet 是一种复杂的频谱映射方法，用于在混响和噪声条件下分离和增强扬声器。所提出的架构包括一个编码器层、一个全局多头自关注模块、一个跨频段模块、一个窄频段模块和一个输出层。TF-CrossNet 可捕捉时频域中的全局、跨频带和窄频带相关性。为了解决长语篇性能下降的问题，我们引入了随机块位置编码。在多个数据集上的实验结果证明了 TF-CrossNet 的有效性和鲁棒性，在混响和嘈杂混响扬声器分离等任务中取得了最先进的性能。此外，与最近的基线相比，TF-CrossNet 的训练速度更快、更稳定。此外，TF-CrossNet 的高性能还延伸到了多麦克风条件下，证明了它在各种声学场景中的多功能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.