{"title":"Joint Coarse to Fine-Grained Spatio-Temporal Modeling for Video Action Recognition","authors":"Chunlei Li;Can Cheng;Miao Yu;Zhoufeng Liu;Di Huang","doi":"10.1109/TBIOM.2025.3532416","DOIUrl":null,"url":null,"abstract":"The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 3","pages":"444-457"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10848154/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The action recognition task involves analyzing video content and temporal relationships between frames to identify actions. Crucial to this process are action representations that effectively capture varying temporal scales and spatial motion variations. To address these challenges, we propose the Joint Coarse to Fine-Grained Spatio-Temporal Modeling (JCFG-STM) approach, which is designed to capture robust spatio-temporal representations through three key components: the Temporal-enhanced Spatio-Temporal Perception (TSTP) module, the Positional-enhanced Spatio-Temporal Perception (PSTP) module, and the Fine-grained Spatio-Temporal Perception (FSTP) module. Specifically, TSTP is designed to fuse temporal information across both local and global spatial scales, while PSTP emphasizes the integration of spatial coordinate directions, both horizontal and vertical, with temporal dynamics. Meanwhile, FSTP focuses on combining spatial coordinate information with short-term temporal data by differentiating neighboring frames, enabling fine-grained spatio-temporal modeling. JCFG-STM effectively focuses on multi-granularity and complementary motion patterns associated with actions. Extensive experiments conducted on large-scale action recognition datasets, including Kinetics-400, Something-Something V2, Jester, and EgoGesture, demonstrate the effectiveness of our approach and its superiority over state-of-the-art methods.