Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He
{"title":"Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval","authors":"Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He","doi":"10.1109/TMM.2024.3443672","DOIUrl":null,"url":null,"abstract":"Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11044-11056"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636802/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.