LoCATe-GAT:零射击动作识别的多尺度局部环境和动作关系建模

IF 5.3 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Sandipan Sarma;Divyam Singal;Arijit Sur
{"title":"LoCATe-GAT:零射击动作识别的多尺度局部环境和动作关系建模","authors":"Sandipan Sarma;Divyam Singal;Arijit Sur","doi":"10.1109/TETCI.2024.3499995","DOIUrl":null,"url":null,"abstract":"The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called <italic>LoCATe-GAT</i>, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 4","pages":"2793-2805"},"PeriodicalIF":5.3000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition\",\"authors\":\"Sandipan Sarma;Divyam Singal;Arijit Sur\",\"doi\":\"10.1109/TETCI.2024.3499995\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called <italic>LoCATe-GAT</i>, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"9 4\",\"pages\":\"2793-2805\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-11-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10769605/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10769605/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

现实世界中越来越多的行为使得传统的深度学习模型难以识别看不见的行为。最近,预训练的基于对比图像的视觉语言(I-VL)模型被用于有效的“零镜头”场景理解。将这些模型与变压器配对来实现时间建模对于零射击动作识别(ZSAR)是有益的。然而,对物体和行动环境的局部空间背景进行建模的意义仍未得到探索。在这项工作中,我们提出了一个名为location -GAT的ZSAR框架,包括一个新的本地上下文聚合时间转换器(LoCATe)和一个图注意网络(GAT)。具体来说,从预训练的I-VL模型中提取的图像和文本编码被用作location - gat的输入。由于观察到以对象为中心和环境上下文驱动动作之间的可区分性和功能相似性,LoCATe在时间建模期间使用扩展卷积层捕获多尺度局部上下文。此外,本文提出的GAT对类之间的语义关系进行建模,并与LoCATe生成的视频嵌入实现了强大的协同作用。在四个广泛使用的基准- UCF101, HMDB51, ActivityNet和Kinetics上进行了广泛的实验,表明我们取得了最先进的结果。具体来说,我们在这些数据集上获得了3.8%和4.8%的相对增益,在广义ZSAR设置的ucf101上获得了16.6%的相对增益。对于像ActivityNet和Kinetics这样的大规模数据集,我们的方法相对于之前的方法分别实现了31.8%和27.9%的相对增益。此外,根据最近的“TruZe”评估协议,我们在UCF101和HMDB51上获得了25.3%和18.4%的收益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition
The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
10.30
自引率
7.50%
发文量
147
期刊介绍: The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信