Audio Feature Generation for Missing Modality Problem in Video Action Recognition

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-14 DOI:10.1109/ICASSP.2019.8682513

Hu-Cheng Lee, Chih-Yu Lin, P. Hsu, Winston H. Hsu

引用次数: 6

Abstract

Despite the recent success of multi-modal action recognition in videos, in reality, we usually confront the situation that some data are not available beforehand, especially for multi-modal data. For example, while vision and audio data are required to address the multi-modal action recognition, audio tracks in videos are easily lost due to the broken files or the limitation of devices. To cope with this sound-missing problem, we present an approach to simulating deep audio feature from merely spatial-temporal vision data. We demonstrate that adding the simulating sound feature can significantly assist the multi-modal action recognition task. Evaluating our method on the Moments in Time (MIT) Dataset , we show that our proposed method performs favorably against the two-stream architecture, enabling a richer understanding of multi-modal action recognition in video.

查看原文本刊更多论文

视频动作识别中模态缺失问题的音频特征生成

尽管近年来视频中的多模态动作识别取得了成功，但在现实生活中，我们经常会遇到一些事先没有数据的情况，特别是对于多模态数据。例如，虽然需要视觉和音频数据来处理多模态动作识别，但视频中的音轨很容易由于文件损坏或设备的限制而丢失。为了解决这种声音缺失问题，我们提出了一种仅从时空视觉数据模拟深度音频特征的方法。我们证明了添加模拟声音特征可以显著地辅助多模态动作识别任务。在时间矩(MIT)数据集上评估我们的方法，我们表明我们提出的方法在两流架构下表现良好，能够更丰富地理解视频中的多模态动作识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量