Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition

IEEE transactions on biometrics, behavior, and identity science Pub Date : 2025-01-21 DOI:10.1109/TBIOM.2025.3532416

Chunlei Li;Can Cheng;Miao Yu;Zhoufeng Liu;Di Huang

{"title":"Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition","authors":"Chunlei Li;Can Cheng;Miao Yu;Zhoufeng Liu;Di Huang","doi":"10.1109/TBIOM.2025.3532416","DOIUrl":null,"url":null,"abstract":"The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 3","pages":"444-457"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10848154/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.

查看原文本刊更多论文

视频动作识别的粗、细粒度时空联合建模

动作识别任务包括分析视频内容和帧之间的时间关系来识别动作。这个过程的关键是动作表征，有效地捕捉不同的时间尺度和空间运动变化。为了应对这些挑战，我们提出了粗粒度到细粒度的联合时空建模（JCFG-STM）方法，该方法旨在通过三个关键组件捕获健壮的时空表征：时间增强时空感知（TSTP）模块、位置增强时空感知（PSTP）模块和细粒度时空感知（FSTP）模块。具体而言，TSTP旨在融合本地和全球空间尺度上的时间信息，而PSTP强调空间坐标方向（水平和垂直）与时间动态的整合。同时，FSTP通过区分相邻帧，将空间坐标信息与短期时间数据相结合，实现细粒度的时空建模。JCFG-STM有效地关注与动作相关的多粒度和互补运动模式。在大规模动作识别数据集上进行的大量实验，包括Kinetics-400， Something-Something V2， Jester和EgoGesture，证明了我们的方法的有效性及其优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on biometrics, behavior, and identity science

CiteScore

10.90

自引率

0.00%

发文量