{"title":"用于时空动作检测的粗-精超图网络","authors":"Ping Li;Xingchao Ye;Lingfeng He","doi":"10.1109/TCSVT.2025.3558939","DOIUrl":null,"url":null,"abstract":"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8653-8665"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection\",\"authors\":\"Ping Li;Xingchao Ye;Lingfeng He\",\"doi\":\"10.1109/TCSVT.2025.3558939\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8653-8665\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955692/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955692/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection
Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.