视频动作识别中模态缺失问题的音频特征生成

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-14 DOI:10.1109/ICASSP.2019.8682513

Hu-Cheng Lee, Chih-Yu Lin, P. Hsu, Winston H. Hsu

{"title":"视频动作识别中模态缺失问题的音频特征生成","authors":"Hu-Cheng Lee, Chih-Yu Lin, P. Hsu, Winston H. Hsu","doi":"10.1109/ICASSP.2019.8682513","DOIUrl":null,"url":null,"abstract":"Despite the recent success of multi-modal action recognition in videos, in reality, we usually confront the situation that some data are not available beforehand, especially for multi-modal data. For example, while vision and audio data are required to address the multi-modal action recognition, audio tracks in videos are easily lost due to the broken files or the limitation of devices. To cope with this sound-missing problem, we present an approach to simulating deep audio feature from merely spatial-temporal vision data. We demonstrate that adding the simulating sound feature can significantly assist the multi-modal action recognition task. Evaluating our method on the Moments in Time (MIT) Dataset , we show that our proposed method performs favorably against the two-stream architecture, enabling a richer understanding of multi-modal action recognition in video.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"3956-3960"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Audio Feature Generation for Missing Modality Problem in Video Action Recognition\",\"authors\":\"Hu-Cheng Lee, Chih-Yu Lin, P. Hsu, Winston H. Hsu\",\"doi\":\"10.1109/ICASSP.2019.8682513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the recent success of multi-modal action recognition in videos, in reality, we usually confront the situation that some data are not available beforehand, especially for multi-modal data. For example, while vision and audio data are required to address the multi-modal action recognition, audio tracks in videos are easily lost due to the broken files or the limitation of devices. To cope with this sound-missing problem, we present an approach to simulating deep audio feature from merely spatial-temporal vision data. We demonstrate that adding the simulating sound feature can significantly assist the multi-modal action recognition task. Evaluating our method on the Moments in Time (MIT) Dataset , we show that our proposed method performs favorably against the two-stream architecture, enabling a richer understanding of multi-modal action recognition in video.\",\"PeriodicalId\":13203,\"journal\":{\"name\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"32 1\",\"pages\":\"3956-3960\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2019.8682513\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8682513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

尽管近年来视频中的多模态动作识别取得了成功，但在现实生活中，我们经常会遇到一些事先没有数据的情况，特别是对于多模态数据。例如，虽然需要视觉和音频数据来处理多模态动作识别，但视频中的音轨很容易由于文件损坏或设备的限制而丢失。为了解决这种声音缺失问题，我们提出了一种仅从时空视觉数据模拟深度音频特征的方法。我们证明了添加模拟声音特征可以显著地辅助多模态动作识别任务。在时间矩(MIT)数据集上评估我们的方法，我们表明我们提出的方法在两流架构下表现良好，能够更丰富地理解视频中的多模态动作识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Audio Feature Generation for Missing Modality Problem in Video Action Recognition

Despite the recent success of multi-modal action recognition in videos, in reality, we usually confront the situation that some data are not available beforehand, especially for multi-modal data. For example, while vision and audio data are required to address the multi-modal action recognition, audio tracks in videos are easily lost due to the broken files or the limitation of devices. To cope with this sound-missing problem, we present an approach to simulating deep audio feature from merely spatial-temporal vision data. We demonstrate that adding the simulating sound feature can significantly assist the multi-modal action recognition task. Evaluating our method on the Moments in Time (MIT) Dataset , we show that our proposed method performs favorably against the two-stream architecture, enabling a richer understanding of multi-modal action recognition in video.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量