用于时空动作检测的粗-精超图网络

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-08 DOI:10.1109/TCSVT.2025.3558939

Ping Li;Xingchao Ye;Lingfeng He

{"title":"用于时空动作检测的粗-精超图网络","authors":"Ping Li;Xingchao Ye;Lingfeng He","doi":"10.1109/TCSVT.2025.3558939","DOIUrl":null,"url":null,"abstract":"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8653-8665"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection\",\"authors\":\"Ping Li;Xingchao Ye;Lingfeng He\",\"doi\":\"10.1109/TCSVT.2025.3558939\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8653-8665\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955692/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955692/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

时空动作检测通过识别动作开始时间和结束时间、动作类和对象（如参与者）边界框，沿空间和时间维度定位动作实例。它面临两个主要挑战：1)在同一个类中，不同的动作持续时间和不一致的动作实例的速度；2)建模复杂的对象交互，这是以前的方法不能很好地处理的。对于前者，我们开发了从粗到精的注意模块，该模块通过消除上下文不可知的特征，采用有效的动态时间扭曲对动作框架进行粗估计，并进一步采用注意机制捕获这些动作框架内的一阶对象关系。这将导致更细粒度的动作估计。对于后者，我们设计了三阶高阶超图神经网络，对空间关系、运动动力学和跨帧不同对象的高阶关系进行建模。这鼓励了同一行动中对象的积极关系，同时抑制了不同行动中对象的消极关系。因此，我们提出了一个用于时空动作检测的粗到精超图网络（简称CFHN），该网络将对象局部上下文、一阶对象关系和高阶对象关系结合起来考虑。它结合了通道维度上的时空一阶和高阶特征，获得了令人满意的检测结果。在包括AVA、jhdb21和UCF101-24在内的几个基准测试上进行的大量实验证明了所提出方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection

Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.