{"title":"DOA或扬声器嵌入:哪个更适合多麦克风目标扬声器提取","authors":"Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan","doi":"10.1109/LSP.2025.3600168","DOIUrl":null,"url":null,"abstract":"Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3350-3354"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction\",\"authors\":\"Shuang Zhang;Jie Zhang;Yichi Wang;Haoyin Yan\",\"doi\":\"10.1109/LSP.2025.3600168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"3350-3354\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11129612/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11129612/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
DOA or Speaker Embedding: Which is Better for Multi-Microphone Target Speaker Extraction
Target speaker extraction (TSE) is a useful front-end to improve the speech quality and intelligibility for speech applications, whereas direction-of-arrival (DOA) and speaker embedding are two of the most often-used assistive clues to identify the target speaker in audio-only multi-microphone systems. Both can significantly improve the TSE performance compared to blind TSE models, which however have not yet been comprehensively compared in literature. In order to show their pros and cons, in this work we therefore build a unified framework for a fair comparison that allows for both DOA and speaker embedding as the assistive clue. The DOA is used to calculate multichannel spatiotemporal speech features and a speaker encoder is designed to extract the speaker embedding, either of which is then fused with the noisy speech features for TSE. We can then evaluate their respective strengths in diverse acoustic conditions, e.g., varying noise level, microphone number, speaker location. Results show that given true DOA angles, the DOA-based TSE model always outperforms the speaker embedding based counterpart regardless of noise/microphone/location conditions, meaning the stronger discriminativity of DOA in terms of speaker identity. This superiority becomes smaller if the DOA mis-match increases, and the latter can do better in the large DOA mismatch case.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.