TF-CrossNet:利用全局、跨带、窄带和位置编码实现单声道和多声道扬声器分离

IF 4.1 2区 计算机科学 Q1 ACOUSTICS
Vahid Ahmadi Kalkhorani;DeLiang Wang
{"title":"TF-CrossNet:利用全局、跨带、窄带和位置编码实现单声道和多声道扬声器分离","authors":"Vahid Ahmadi Kalkhorani;DeLiang Wang","doi":"10.1109/TASLP.2024.3492803","DOIUrl":null,"url":null,"abstract":"We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4999-5009"},"PeriodicalIF":4.1000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation\",\"authors\":\"Vahid Ahmadi Kalkhorani;DeLiang Wang\",\"doi\":\"10.1109/TASLP.2024.3492803\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4999-5009\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10745728/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10745728/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

我们介绍的 TF-CrossNet 是一种复杂的频谱映射方法,用于在混响和噪声条件下分离和增强扬声器。所提出的架构包括一个编码器层、一个全局多头自关注模块、一个跨频段模块、一个窄频段模块和一个输出层。TF-CrossNet 可捕捉时频域中的全局、跨频带和窄频带相关性。为了解决长语篇性能下降的问题,我们引入了随机块位置编码。在多个数据集上的实验结果证明了 TF-CrossNet 的有效性和鲁棒性,在混响和嘈杂混响扬声器分离等任务中取得了最先进的性能。此外,与最近的基线相比,TF-CrossNet 的训练速度更快、更稳定。此外,TF-CrossNet 的高性能还延伸到了多麦克风条件下,证明了它在各种声学场景中的多功能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation
We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信