ConCAP: Contrastive Context-Aware Prompt for Resource-hungry Action Recognition

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00137

Hailun Zhang, Ziyun Zeng, Qijun Zhao, Zhen Zhai

{"title":"ConCAP: Contrastive Context-Aware Prompt for Resource-hungry Action Recognition","authors":"Hailun Zhang, Ziyun Zeng, Qijun Zhao, Zhen Zhai","doi":"10.1109/ICME55011.2023.00137","DOIUrl":null,"url":null,"abstract":"Existing large-scale image-language pre-trained models, e.g., CLIP [1], have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during self-attention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing large-scale image-language pre-trained models, e.g., CLIP [1], have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during self-attention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.

查看原文本刊更多论文

资源匮乏行动识别的对比上下文感知提示

现有的大规模图像语言预训练模型，如CLIP[1]，在各种视觉任务上显示出较强的空间识别能力。然而，由于缺乏时间推理能力，它们在动作识别方面表现不佳。此外，完全调优大型模型需要昂贵的计算基础设施，而最先进的视频模型由于帧采样率高而导致推理速度慢。上述缺点使得现有的视频动作识别工作无法应用于资源匮乏的场景，而这在现实世界中是很常见的。在这项工作中，我们提出了对比上下文感知提示(ConCAP)用于资源饥渴行为识别。具体来说，我们开发了一个轻量级的PromptFormer来学习堆叠在冻结的帧视觉主干上的时空表征，其中可学习的提示符号在自我注意期间插入帧令牌之间。这些提示符号有望自动完成帧之间的上下文时空信息，从而增强模型的表示能力。为了实现这一目标，我们将提示增强的表示与类别级文本表示和来自密集采样帧的视频表示结合起来。在四个视频基准上进行的大量实验表明，与现有方法相比，我们获得了最先进的或具有竞争力的性能，可训练参数少得多，在有限帧下的推理速度更快，这证明了ConCAP在资源匮乏场景中的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量