Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung
{"title":"基于超交互的辅助矩查询对改进视频矩检索","authors":"Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung","doi":"10.1109/TCSVT.2024.3513633","DOIUrl":null,"url":null,"abstract":"Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3940-3954"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction\",\"authors\":\"Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung\",\"doi\":\"10.1109/TCSVT.2024.3513633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"3940-3954\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10786261/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10786261/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction
Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.