{"title":"基于位置引导语音特征映射网络的多模态多通道语音分离","authors":"Yulin Wu , Xiaochen Wang , Dengshi Li , Ruimin Hu","doi":"10.1016/j.neucom.2025.131051","DOIUrl":null,"url":null,"abstract":"<div><div>In reality, the audio and visual signals of sound sources are closely aligned, working collaboratively to isolate the desired speech signal from overlapping voices of simultaneous talkers. To leverage the complementarity and utilize all available information from both auditory and visual sources in speech separation, we propose a novel robust multimodal and multichannel speech separation method, without requiring known camera parameters. The proposed method exploits the complementarity of audio and visual modalities to estimate the speaker’s location and adopts a location-guided speech feature mapping strategy, wherein the attention mechanism fusion method combines the high-level semantic information of auditory and visual sources, aiding in the separation of target speech with its corresponding directional features. Experimental results suggest that the proposed multimodal and multichannel speech separation system outperforms the baselines, demonstrating improvements of 0.64 <em>dB</em> in SI-SDR and 0.17 in PESQ, respectively. The proposed system consistently outperformed the baselines by achieving a 10.14 % absolute (26.05 % relative) word error rate (WER) reduction.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131051"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal and multichannel speech separation using location-guided speech feature mapping network\",\"authors\":\"Yulin Wu , Xiaochen Wang , Dengshi Li , Ruimin Hu\",\"doi\":\"10.1016/j.neucom.2025.131051\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In reality, the audio and visual signals of sound sources are closely aligned, working collaboratively to isolate the desired speech signal from overlapping voices of simultaneous talkers. To leverage the complementarity and utilize all available information from both auditory and visual sources in speech separation, we propose a novel robust multimodal and multichannel speech separation method, without requiring known camera parameters. The proposed method exploits the complementarity of audio and visual modalities to estimate the speaker’s location and adopts a location-guided speech feature mapping strategy, wherein the attention mechanism fusion method combines the high-level semantic information of auditory and visual sources, aiding in the separation of target speech with its corresponding directional features. Experimental results suggest that the proposed multimodal and multichannel speech separation system outperforms the baselines, demonstrating improvements of 0.64 <em>dB</em> in SI-SDR and 0.17 in PESQ, respectively. The proposed system consistently outperformed the baselines by achieving a 10.14 % absolute (26.05 % relative) word error rate (WER) reduction.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 131051\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225017230\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225017230","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multimodal and multichannel speech separation using location-guided speech feature mapping network
In reality, the audio and visual signals of sound sources are closely aligned, working collaboratively to isolate the desired speech signal from overlapping voices of simultaneous talkers. To leverage the complementarity and utilize all available information from both auditory and visual sources in speech separation, we propose a novel robust multimodal and multichannel speech separation method, without requiring known camera parameters. The proposed method exploits the complementarity of audio and visual modalities to estimate the speaker’s location and adopts a location-guided speech feature mapping strategy, wherein the attention mechanism fusion method combines the high-level semantic information of auditory and visual sources, aiding in the separation of target speech with its corresponding directional features. Experimental results suggest that the proposed multimodal and multichannel speech separation system outperforms the baselines, demonstrating improvements of 0.64 dB in SI-SDR and 0.17 in PESQ, respectively. The proposed system consistently outperformed the baselines by achieving a 10.14 % absolute (26.05 % relative) word error rate (WER) reduction.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.