DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction

IF 3.9 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan
{"title":"DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction","authors":"Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan","doi":"10.1109/LSP.2025.3600168","DOIUrl":null,"url":null,"abstract":"Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3350-3354"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11129612/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.
DOA或扬声器嵌入:哪个更适合多麦克风目标扬声器提取
在纯音频多麦克风系统中,目标说话人提取(TSE)是提高语音质量和可听性的有效前端,而到达方向(DOA)和说话人嵌入是识别目标说话人最常用的辅助线索。与盲目的TSE模型相比,两者都能显著提高TSE的性能,但尚未有文献对两者进行全面比较。为了展示它们的优缺点,在这项工作中,我们因此建立了一个统一的框架来进行公平的比较,该框架允许DOA和说话人嵌入作为辅助线索。利用DOA计算多通道语音时空特征,设计扬声器编码器提取扬声器嵌入,然后将其与噪声语音特征进行融合,实现多通道语音识别。然后,我们可以评估它们在不同声学条件下的各自优势,例如,不同的噪声水平,麦克风数量,扬声器位置。结果表明,在给定真实DOA角的情况下,无论噪声/麦克风/位置条件如何,基于DOA的TSE模型始终优于基于扬声器嵌入的TSE模型,这意味着DOA在说话人身份方面具有更强的判别能力。当DOA不匹配增加时,这种优势会变小,而后者在DOA不匹配大的情况下可以做得更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Signal Processing Letters
IEEE Signal Processing Letters 工程技术-工程:电子与电气
CiteScore
7.40
自引率
12.80%
发文量
339
审稿时长
2.8 months
期刊介绍: The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信