{"title":"基于dptnet的语音分离波束形成","authors":"Tongtong Zhao, C. Bao, Xue Yang, Xu Zhang","doi":"10.1109/ICSPCC55723.2022.9984356","DOIUrl":null,"url":null,"abstract":"Filter-and-sum beamforming framework could separate speech effectively from the complicated acoustic scenarios by using dual-path recurrent neural network (DPRNN) to estimate the beamforming filters. Since the concerned context information was modeled by recurrent layers of the intermediate states, only the suboptimal separation performance can be achieved. To increase the performance, the dual-path transformer network (DPTNet) is employed to estimate beamforming filters instead of DPRNN in this paper because the DPTNet takes advantage of self-attention mechanism and makes high dimension feature sequences interacted directly. Specifically, to provide the spatial and context information of multi-channel speech signals, the cosine similarities between different channels are first concatenated with the transformed speech signals to serve as the input. Then, the DPTNet and transform-averaged-concatenation operation are used to extract context information for estimating beamforming filter of each channel. Finally, the observed signal of each channel is filtered and added to obtain the desired speech. Compared with the existing FaSNet, the proposed method can achieve better separation performance.","PeriodicalId":346917,"journal":{"name":"2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"DPTNet-based Beamforming for Speech Separation\",\"authors\":\"Tongtong Zhao, C. Bao, Xue Yang, Xu Zhang\",\"doi\":\"10.1109/ICSPCC55723.2022.9984356\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Filter-and-sum beamforming framework could separate speech effectively from the complicated acoustic scenarios by using dual-path recurrent neural network (DPRNN) to estimate the beamforming filters. Since the concerned context information was modeled by recurrent layers of the intermediate states, only the suboptimal separation performance can be achieved. To increase the performance, the dual-path transformer network (DPTNet) is employed to estimate beamforming filters instead of DPRNN in this paper because the DPTNet takes advantage of self-attention mechanism and makes high dimension feature sequences interacted directly. Specifically, to provide the spatial and context information of multi-channel speech signals, the cosine similarities between different channels are first concatenated with the transformed speech signals to serve as the input. Then, the DPTNet and transform-averaged-concatenation operation are used to extract context information for estimating beamforming filter of each channel. Finally, the observed signal of each channel is filtered and added to obtain the desired speech. Compared with the existing FaSNet, the proposed method can achieve better separation performance.\",\"PeriodicalId\":346917,\"journal\":{\"name\":\"2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)\",\"volume\":\"68 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSPCC55723.2022.9984356\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPCC55723.2022.9984356","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Filter-and-sum beamforming framework could separate speech effectively from the complicated acoustic scenarios by using dual-path recurrent neural network (DPRNN) to estimate the beamforming filters. Since the concerned context information was modeled by recurrent layers of the intermediate states, only the suboptimal separation performance can be achieved. To increase the performance, the dual-path transformer network (DPTNet) is employed to estimate beamforming filters instead of DPRNN in this paper because the DPTNet takes advantage of self-attention mechanism and makes high dimension feature sequences interacted directly. Specifically, to provide the spatial and context information of multi-channel speech signals, the cosine similarities between different channels are first concatenated with the transformed speech signals to serve as the input. Then, the DPTNet and transform-averaged-concatenation operation are used to extract context information for estimating beamforming filter of each channel. Finally, the observed signal of each channel is filtered and added to obtain the desired speech. Compared with the existing FaSNet, the proposed method can achieve better separation performance.