Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu
{"title":"DP-DWA:基于流式Dfsmn-San的自动语音识别双路径动态权重注意网络","authors":"Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu","doi":"10.1109/icassp43922.2022.9746328","DOIUrl":null,"url":null,"abstract":"In multi-channel far-field automatic speech recognition (ASR) scenarios, distortion is introduced when the speech signal is processed by the front end, which damages the recognition performance for the ASR tasks. In this paper, we propose a dual-path network for the far-field acoustic model, which uses voice processing (VP) signal and acoustic echo cancellation (AEC) signal as input. Specifically, we design a dynamic weight attention (DWA) module for combining two signals. Besides, we streamline our best deep feed-forward sequential memory network with self-attention (DFSMN-SAN) acoustic model for real-time requirements. Joint-training strategy is adopted to optimize the proposed approach. We find that with dual-path network, we can achieve a 54.5% relative improvement in character error rate (CER) on a 10,000-hour online conference task. In addition, our proposed method is not affected by the arrangement of different microphone arrays. We achieve a 23.56% relative improvement on a vehicle task, which has an array with two microphones.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition\",\"authors\":\"Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu\",\"doi\":\"10.1109/icassp43922.2022.9746328\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In multi-channel far-field automatic speech recognition (ASR) scenarios, distortion is introduced when the speech signal is processed by the front end, which damages the recognition performance for the ASR tasks. In this paper, we propose a dual-path network for the far-field acoustic model, which uses voice processing (VP) signal and acoustic echo cancellation (AEC) signal as input. Specifically, we design a dynamic weight attention (DWA) module for combining two signals. Besides, we streamline our best deep feed-forward sequential memory network with self-attention (DFSMN-SAN) acoustic model for real-time requirements. Joint-training strategy is adopted to optimize the proposed approach. We find that with dual-path network, we can achieve a 54.5% relative improvement in character error rate (CER) on a 10,000-hour online conference task. In addition, our proposed method is not affected by the arrangement of different microphone arrays. We achieve a 23.56% relative improvement on a vehicle task, which has an array with two microphones.\",\"PeriodicalId\":272439,\"journal\":{\"name\":\"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icassp43922.2022.9746328\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp43922.2022.9746328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition
In multi-channel far-field automatic speech recognition (ASR) scenarios, distortion is introduced when the speech signal is processed by the front end, which damages the recognition performance for the ASR tasks. In this paper, we propose a dual-path network for the far-field acoustic model, which uses voice processing (VP) signal and acoustic echo cancellation (AEC) signal as input. Specifically, we design a dynamic weight attention (DWA) module for combining two signals. Besides, we streamline our best deep feed-forward sequential memory network with self-attention (DFSMN-SAN) acoustic model for real-time requirements. Joint-training strategy is adopted to optimize the proposed approach. We find that with dual-path network, we can achieve a 54.5% relative improvement in character error rate (CER) on a 10,000-hour online conference task. In addition, our proposed method is not affected by the arrangement of different microphone arrays. We achieve a 23.56% relative improvement on a vehicle task, which has an array with two microphones.