CGCN: Context graph convolutional network for few-shot temporal action localization

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2024-10-15 DOI:10.1016/j.ipm.2024.103926

Shihui Zhang , Houlin Wang , Lei Wang , Xueqiang Han , Qing Tian

{"title":"CGCN: Context graph convolutional network for few-shot temporal action localization","authors":"Shihui Zhang , Houlin Wang , Lei Wang , Xueqiang Han , Qing Tian","doi":"10.1016/j.ipm.2024.103926","DOIUrl":null,"url":null,"abstract":"<div><div>Localizing human actions in videos has attracted extensive attention from industry and academia. Few-Shot Temporal Action Localization (FS-TAL) aims to detect human actions in untrimmed videos using a limited number of training samples. Existing FS-TAL methods usually ignore the semantic context between video snippets, making it difficult to detect actions during the query process. In this paper, we propose a novel FS-TAL method named Context Graph Convolutional Network (CGCN) which employs multi-scale graph convolution to aggregate semantic context between video snippets in addition to exploiting their temporal context. Specifically, CGCN constructs a graph for each scale of a video, where each video snippet is a node, and the relationships between the snippets are edges. There are three types of edges, namely sequence edges, intra-action edges, and inter-action edges. CGCN establishes sequence edges to enhance temporal expression. Intra-action edges utilize hyperbolic space to encapsulate context among video snippets within each action, while inter-action edges leverage Euclidean space to capture similar semantics between different actions. Through graph convolution on each scale, CGCN enables the acquisition of richer and context-aware video representations. Experiments demonstrate CGCN outperforms the second-best method by 4.5%/0.9% and 4.3%/0.9% mAP on the ActivityNet and THUMOS14 datasets in one-shot/five-shot scenarios, respectively, at [email protected]. The source code can be found in <span><span>https://github.com/mugenggeng/CGCN.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 1","pages":"Article 103926"},"PeriodicalIF":7.4000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002851","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Localizing human actions in videos has attracted extensive attention from industry and academia. Few-Shot Temporal Action Localization (FS-TAL) aims to detect human actions in untrimmed videos using a limited number of training samples. Existing FS-TAL methods usually ignore the semantic context between video snippets, making it difficult to detect actions during the query process. In this paper, we propose a novel FS-TAL method named Context Graph Convolutional Network (CGCN) which employs multi-scale graph convolution to aggregate semantic context between video snippets in addition to exploiting their temporal context. Specifically, CGCN constructs a graph for each scale of a video, where each video snippet is a node, and the relationships between the snippets are edges. There are three types of edges, namely sequence edges, intra-action edges, and inter-action edges. CGCN establishes sequence edges to enhance temporal expression. Intra-action edges utilize hyperbolic space to encapsulate context among video snippets within each action, while inter-action edges leverage Euclidean space to capture similar semantics between different actions. Through graph convolution on each scale, CGCN enables the acquisition of richer and context-aware video representations. Experiments demonstrate CGCN outperforms the second-best method by 4.5%/0.9% and 4.3%/0.9% mAP on the ActivityNet and THUMOS14 datasets in one-shot/five-shot scenarios, respectively, at [email protected]. The source code can be found in https://github.com/mugenggeng/CGCN.git.

查看原文本刊更多论文

CGCN：用于少量时间动作定位的上下文图卷积网络

视频中的人类动作定位引起了业界和学术界的广泛关注。少镜头时态动作定位（FS-TAL）旨在利用有限的训练样本检测未剪辑视频中的人类动作。现有的 FS-TAL 方法通常会忽略视频片段之间的语义上下文，因此很难在查询过程中检测到动作。在本文中，我们提出了一种名为 "上下文图卷积网络（CGCN）"的新型 FS-TAL 方法，该方法除了利用视频片段的时间上下文外，还利用多尺度图卷积来聚合视频片段之间的语义上下文。具体来说，CGCN 为视频的每个尺度构建一个图，其中每个视频片段是一个节点，片段之间的关系是边。边有三种类型，即序列边、动作内边和动作间边。CGCN 通过建立序列边缘来增强时间表达能力。动作内边缘利用双曲空间来封装每个动作中视频片段之间的上下文，而动作间边缘则利用欧几里得空间来捕捉不同动作之间的相似语义。通过在每个尺度上进行图卷积，CGCN 能够获得更丰富的上下文感知视频表示。实验证明，在ActivityNet和THUMOS14数据集上，CGCN在一帧/五帧场景下的mAP分别比第二好的方法高出4.5%/0.9%和4.3%/0.9%，详情请访问[email protected]。源代码见 https://github.com/mugenggeng/CGCN.git。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.