Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition

Chunlei Li;Can Cheng;Miao Yu;Zhoufeng Liu;Di Huang
{"title":"Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition","authors":"Chunlei Li;Can Cheng;Miao Yu;Zhoufeng Liu;Di Huang","doi":"10.1109/TBIOM.2025.3532416","DOIUrl":null,"url":null,"abstract":"The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 3","pages":"444-457"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10848154/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.
视频动作识别的粗、细粒度时空联合建模
动作识别任务包括分析视频内容和帧之间的时间关系来识别动作。这个过程的关键是动作表征,有效地捕捉不同的时间尺度和空间运动变化。为了应对这些挑战,我们提出了粗粒度到细粒度的联合时空建模(JCFG-STM)方法,该方法旨在通过三个关键组件捕获健壮的时空表征:时间增强时空感知(TSTP)模块、位置增强时空感知(PSTP)模块和细粒度时空感知(FSTP)模块。具体而言,TSTP旨在融合本地和全球空间尺度上的时间信息,而PSTP强调空间坐标方向(水平和垂直)与时间动态的整合。同时,FSTP通过区分相邻帧,将空间坐标信息与短期时间数据相结合,实现细粒度的时空建模。JCFG-STM有效地关注与动作相关的多粒度和互补运动模式。在大规模动作识别数据集上进行的大量实验,包括Kinetics-400, Something-Something V2, Jester和EgoGesture,证明了我们的方法的有效性及其优于最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
10.90
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信