DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction

IF 3.9 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2025-08-19 DOI:10.1109/LSP.2025.3600168

Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan

{"title":"DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction","authors":"Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan","doi":"10.1109/LSP.2025.3600168","DOIUrl":null,"url":null,"abstract":"Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3350-3354"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11129612/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.

查看原文本刊更多论文

DOA或扬声器嵌入：哪个更适合多麦克风目标扬声器提取

在纯音频多麦克风系统中，目标说话人提取（TSE）是提高语音质量和可听性的有效前端，而到达方向（DOA）和说话人嵌入是识别目标说话人最常用的辅助线索。与盲目的TSE模型相比，两者都能显著提高TSE的性能，但尚未有文献对两者进行全面比较。为了展示它们的优缺点，在这项工作中，我们因此建立了一个统一的框架来进行公平的比较，该框架允许DOA和说话人嵌入作为辅助线索。利用DOA计算多通道语音时空特征，设计扬声器编码器提取扬声器嵌入，然后将其与噪声语音特征进行融合，实现多通道语音识别。然后，我们可以评估它们在不同声学条件下的各自优势，例如，不同的噪声水平，麦克风数量，扬声器位置。结果表明，在给定真实DOA角的情况下，无论噪声/麦克风/位置条件如何，基于DOA的TSE模型始终优于基于扬声器嵌入的TSE模型，这意味着DOA在说话人身份方面具有更强的判别能力。当DOA不匹配增加时，这种优势会变小，而后者在DOA不匹配大的情况下可以做得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.