Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction

IF 8.3 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung
{"title":"Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction","authors":"Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung","doi":"10.1109/TCSVT.2024.3513633","DOIUrl":null,"url":null,"abstract":"Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3940-3954"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10786261/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.
基于超交互的辅助矩查询对改进视频矩检索
大多数现有的视频时刻检索(VMR)基准数据集都面临着稀疏注释的共同问题——只有少数时刻被注释。我们认为,视频包含了更广泛的有意义的时刻,如果加以利用,可以显著提高表现。现有的方法通常遵循“生成-然后选择”的范式,主要关注矩-查询对的生成,而忽略了选择的关键方面。在本文中,我们提出了一种新的方法HyperAux,通过建模视频和语言之间的多模态超交互来产生辅助的矩-查询对。具体来说,给定一组来自视频的候选矩查询对,我们构造一个具有多个超边的超图,每个超边对应一个矩查询对。与每个边缘只连接两个节点(帧或查询)的传统图不同,每个超级边缘连接多个节点,包括时刻内的所有帧、时刻外的语义相关帧和输入查询。这种设计允许我们将瞬间内的帧作为一个整体来考虑,而不是单独对单个帧-查询关系进行建模。更重要的是,将视频中所有时刻查询对之间的关系构建成一个大的超图,有助于从这些对中选择更高质量的数据。在这个超图上,我们使用超图神经网络来聚合节点信息,更新超边缘,并将视频语言超交互传播到每个连接的节点,从而产生上下文感知的节点表示。这使我们能够使用节点相关性来选择高质量的矩-查询对并细化矩的边界。我们还利用矩内和矩外语义匹配的差异构造了一个损失函数,用于在没有人工注释的情况下训练HGNN。我们的辅助数据在三个广泛使用的VMR数据集(ActivityNet Captions, Charades-STA和QVHighlights)上增强了12个VMR模型在完全监督、弱监督和零射击设置下的性能。我们将公开发布源代码和模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信