要点、内容、目标导向：用于视频瞬间检索的三层类人框架

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-08-14 DOI:10.1109/TMM.2024.3443672

Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He

{"title":"要点、内容、目标导向：用于视频瞬间检索的三层类人框架","authors":"Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He","doi":"10.1109/TMM.2024.3443672","DOIUrl":null,"url":null,"abstract":"Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11044-11056"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval\",\"authors\":\"Di Wang;Xiantao Lu;Quan Wang;Yumin Tian;Bo Wan;Lihuo He\",\"doi\":\"10.1109/TMM.2024.3443672\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"11044-11056\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10636802/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636802/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

视频瞬间检索（VMR）的目的是通过给定的自然语言查询在未经剪辑的视频中找到相应的瞬间。现有的大多数方法都将这一任务视为跨模态内容匹配或边界预测问题，而最近的研究则开始从阅读理解的角度来解决 VMR 问题。然而，现有模型的跨模态交互过程要么不够充分，要么过于复杂。因此，我们重新分析了人类在阅读理解的文档片段定位任务中的行为，并针对每种行为设计了特定模块，提出了三层类人时刻检索框架（Tri-MRF）。具体来说，我们总结了人类在阅读理解任务中的行为，如分别把握文档和问题的一般结构，交叉扫描以标记文档中关键词与问题中关键词的直接对应关系，以及归纳总结以获得文档片段与问题的整体对应关系。相应地，所提出的 Tri-MRF 模型包含三个模块：1) 一个面向要点的模态内理解模块，用于建立每个模态内的上下文依赖关系；2) 一个面向内容的细粒度理解模块，用于探索片段与词语之间的直接对应关系；3) 一个面向目标的综合理解模块，用于验证候选时刻与查询之间的整体对应关系。此外，我们还引入了双连接 GCN 特征增强模块，以优化查询引导的时刻表示。在 TACoS、ActivityNet Captions 和 Charades-STA 这三个基准上进行的广泛实验表明，所提出的框架优于最新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment Retrieval

Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.