Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-09 DOI:10.1109/TCSVT.2024.3513633

Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung

{"title":"Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction","authors":"Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung","doi":"10.1109/TCSVT.2024.3513633","DOIUrl":null,"url":null,"abstract":"Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3940-3954"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10786261/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.

查看原文本刊更多论文

基于超交互的辅助矩查询对改进视频矩检索

大多数现有的视频时刻检索（VMR）基准数据集都面临着稀疏注释的共同问题——只有少数时刻被注释。我们认为，视频包含了更广泛的有意义的时刻，如果加以利用，可以显著提高表现。现有的方法通常遵循“生成-然后选择”的范式，主要关注矩-查询对的生成，而忽略了选择的关键方面。在本文中，我们提出了一种新的方法HyperAux，通过建模视频和语言之间的多模态超交互来产生辅助的矩-查询对。具体来说，给定一组来自视频的候选矩查询对，我们构造一个具有多个超边的超图，每个超边对应一个矩查询对。与每个边缘只连接两个节点（帧或查询）的传统图不同，每个超级边缘连接多个节点，包括时刻内的所有帧、时刻外的语义相关帧和输入查询。这种设计允许我们将瞬间内的帧作为一个整体来考虑，而不是单独对单个帧-查询关系进行建模。更重要的是，将视频中所有时刻查询对之间的关系构建成一个大的超图，有助于从这些对中选择更高质量的数据。在这个超图上，我们使用超图神经网络来聚合节点信息，更新超边缘，并将视频语言超交互传播到每个连接的节点，从而产生上下文感知的节点表示。这使我们能够使用节点相关性来选择高质量的矩-查询对并细化矩的边界。我们还利用矩内和矩外语义匹配的差异构造了一个损失函数，用于在没有人工注释的情况下训练HGNN。我们的辅助数据在三个广泛使用的VMR数据集（ActivityNet Captions， Charades-STA和QVHighlights）上增强了12个VMR模型在完全监督、弱监督和零射击设置下的性能。我们将公开发布源代码和模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.