基于深度多模态融合的视听语音增强

2022 5th International Conference on Information Communication and Signal Processing (ICICSP) Pub Date : 2022-11-26 DOI:10.1109/ICICSP55539.2022.10050611

B. Yu, Zhan Zhang, Ding Zhao, Yuehai Wang

{"title":"基于深度多模态融合的视听语音增强","authors":"B. Yu, Zhan Zhang, Ding Zhao, Yuehai Wang","doi":"10.1109/ICICSP55539.2022.10050611","DOIUrl":null,"url":null,"abstract":"In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.","PeriodicalId":281095,"journal":{"name":"2022 5th International Conference on Information Communication and Signal Processing (ICICSP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Audio-Visual Speech Enhancement with Deep Multi-modality Fusion\",\"authors\":\"B. Yu, Zhan Zhang, Ding Zhao, Yuehai Wang\",\"doi\":\"10.1109/ICICSP55539.2022.10050611\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.\",\"PeriodicalId\":281095,\"journal\":{\"name\":\"2022 5th International Conference on Information Communication and Signal Processing (ICICSP)\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 5th International Conference on Information Communication and Signal Processing (ICICSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICSP55539.2022.10050611\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Information Communication and Signal Processing (ICICSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICSP55539.2022.10050611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在日常互动中，人类语音感知本质上是一个多模态过程。视听语音增强(AV-SE)旨在借助视觉信息来辅助语音增强。然而，大多数AV-SE方法的融合策略过于简单，导致音频模态占主导地位。视觉模态通常被忽略，特别是当信噪比(SNR)中等或较高时。提出了一种基于编码器-解码器的AV-SE深度多模态融合卷积神经网络。深度多模态融合利用时间注意力对多模态特征进行选择性对齐，并通过线性插值保持时间相关性。该融合策略可以充分利用视频特征，实现均衡的多模态表示。为了进一步提高AV-SE的性能，引入了混合深度特征损失。采用两种神经网络分别对语音信号和噪声信号的特征进行建模。在ncd - timit上进行的实验证明了我们提出的模型的有效性。与纯音频基线和简单融合方法相比，我们的模型在所有信噪比条件下的客观指标上都取得了更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 5th International Conference on Information Communication and Signal Processing (ICICSP)

自引率

0.00%

发文量