SQL-Net: Semantic Query Learning for Point-Supervised Temporal Action Localization

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521799

Yu Wang;Shengjie Zhao;Shiwei Chen

{"title":"SQL-Net: Semantic Query Learning for Point-Supervised Temporal Action Localization","authors":"Yu Wang;Shengjie Zhao;Shiwei Chen","doi":"10.1109/TMM.2024.3521799","DOIUrl":null,"url":null,"abstract":"Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"84-94"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814700/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.

查看原文本刊更多论文

SQL-Net：点监督时态动作定位的语义查询学习

点监督时间动作定位（PS-TAL）检测动作的时间间隔在未修剪的视频与标签有效的范式。然而，大多数现有方法在没有实例级注释的情况下无法学习动作完整性，导致区域预测不完整。事实上，片段的语义信息对于检测完整的动作至关重要，这意味着具有相似表示的片段应被视为相同的动作类别。为了解决这一问题，我们提出了一种新的带有语义查询机制的表示改进框架，以增强片段级特征的可辨别性。具体来说，我们设置了一组可学习的查询，每个查询代表一个特定的动作类别，并根据视频上下文动态更新它们。在这些查询的帮助下，我们期望搜索符合其语义的最佳操作序列。此外，我们利用一些可靠的建议作为伪标签，并设计了一个细化和完整性模块来进一步细化时间边界，以便捕获动作实例的完整性。最后，我们在THUMOS14和ActivityNet13基准测试中证明了所提出方法优于现有最先进方法的优越性。值得注意的是，由于完备性学习，我们的算法在更严格的评估指标下取得了显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.