{"title":"资源匮乏行动识别的对比上下文感知提示","authors":"Hailun Zhang, Ziyun Zeng, Qijun Zhao, Zhen Zhai","doi":"10.1109/ICME55011.2023.00137","DOIUrl":null,"url":null,"abstract":"Existing large-scale image-language pre-trained models, e.g., CLIP [1], have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during self-attention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ConCAP: Contrastive Context-Aware Prompt for Resource-hungry Action Recognition\",\"authors\":\"Hailun Zhang, Ziyun Zeng, Qijun Zhao, Zhen Zhai\",\"doi\":\"10.1109/ICME55011.2023.00137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing large-scale image-language pre-trained models, e.g., CLIP [1], have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during self-attention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"106 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00137\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ConCAP: Contrastive Context-Aware Prompt for Resource-hungry Action Recognition
Existing large-scale image-language pre-trained models, e.g., CLIP [1], have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during self-attention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.