缩小真实和仿真条件下时域多通道语音增强的差距

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-10-17 DOI:10.1109/WASPAA52581.2021.9632720

Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Y. Qian

{"title":"缩小真实和仿真条件下时域多通道语音增强的差距","authors":"Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Y. Qian","doi":"10.1109/WASPAA52581.2021.9632720","DOIUrl":null,"url":null,"abstract":"The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions\",\"authors\":\"Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Y. Qian\",\"doi\":\"10.1109/WASPAA52581.2021.9632720\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.\",\"PeriodicalId\":429900,\"journal\":{\"name\":\"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WASPAA52581.2021.9632720\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WASPAA52581.2021.9632720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

基于深度学习的时域模型，如convt - tasnet，在单通道和多通道语音增强中都显示出巨大的潜力。然而，许多关于时域语音增强模型的实验都是在模拟条件下进行的，并没有很好地研究这种良好的性能是否可以推广到现实场景。在本文中，我们的目的是提供一个有洞察力的研究，应用多通道基于卷积- tasnet的语音增强模拟和真实数据。我们的初步实验表明，在ASR性能方面，两种条件之间存在较大的性能差距。为了弥补这一缺陷，我们采用了几种方法，包括采用多种策略将多通道卷积tasnet集成到波束形成模型中，以及语音增强和语音识别模型的联合训练。我们在CHiME-4语料库上的实验表明，我们提出的方法可以大大降低模拟和真实数据之间的语音识别性能差异，同时在前端保持强大的语音增强能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions

The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus show that our proposed approaches can greatly reduce the speech recognition performance discrepancy between simulation and real data, while preserving the strong speech enhancement capability in the frontend.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

自引率

0.00%

发文量