基于生成对抗训练的在线视听语音分离

2021 7th International Conference on Computing and Artificial Intelligence Pub Date : 2021-04-23 DOI:10.1145/3467707.3467764

Peng Zhang, Jiaming Xu, Yunzhe Hao, Bo Xu

{"title":"基于生成对抗训练的在线视听语音分离","authors":"Peng Zhang, Jiaming Xu, Yunzhe Hao, Bo Xu","doi":"10.1145/3467707.3467764","DOIUrl":null,"url":null,"abstract":"Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.","PeriodicalId":145582,"journal":{"name":"2021 7th International Conference on Computing and Artificial Intelligence","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Online Audio-Visual Speech Separation with Generative Adversarial Training\",\"authors\":\"Peng Zhang, Jiaming Xu, Yunzhe Hao, Bo Xu\",\"doi\":\"10.1145/3467707.3467764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.\",\"PeriodicalId\":145582,\"journal\":{\"name\":\"2021 7th International Conference on Computing and Artificial Intelligence\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 7th International Conference on Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3467707.3467764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3467707.3467764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

视听语音分离已被证明是解决鸡尾酒会问题的有效方法。然而，大多数模型不能满足在线处理，这限制了它们在视频通信和人机交互中的应用。此外，SI-SNR是语音分离中最常用的训练损失函数，它会在分离的音频中产生一些伪影，从而影响语音自动识别(ASR)等下游应用。在本文中，我们提出了一种带有生成对抗训练的在线视听语音分离模型来解决上述两个问题。我们用因果时间卷积网络块构建了我们的生成器(即视听语音分离器)，并提出了一种流推理策略，该策略允许我们的模型以在线方式进行语音分离。鉴别器参与了对发生器的优化，降低了si -信噪比的负面影响。基于挑战性视听数据集LRS2的模拟双扬声器混合实验表明，在相同模型尺寸下，我们的模型优于最先进的纯音频模型convt - tasnet和视听模型advr-AVSS。我们在GPU和CPU上测试了模型的运行时间，结果表明我们的模型满足在线处理。演示和代码可以在https://github.com/aispeech-lab/oavss上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Online Audio-Visual Speech Separation with Generative Adversarial Training

Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 7th International Conference on Computing and Artificial Intelligence

自引率

0.00%

发文量