{"title":"QSMT-net: A query-sensitive proposal and multi-temporal-span matching network for video grounding","authors":"","doi":"10.1016/j.imavis.2024.105188","DOIUrl":null,"url":null,"abstract":"<div><p>The video grounding task aims to retrieve moments from the videos corresponding to a given textual query. This task poses significant challenges because of the need to comprehend the semantic content of both videos and sentences as well as manage the matching relationship between modalities. Existing approaches struggle to effectively meet this challenge, as they often lack consideration for the diversity in constructing proposals to fit segments from varied scenes and disregard the multi-temporal scale matching relationship between queries and proposals. In this paper, we propose the Query-Sensitive Proposal and Multi-Temporal-Span Matching Network (QSMT-Net), an innovative framework designed to generate more distinctive proposals and to enhance the matching between queries and candidate proposals over varying temporal spans. First, we fortify the connection between modes by instituting fine-grained interactions between video clips and textual words. Subsequently, through a learnable pooling mechanism, we dynamically construct candidate proposals tailored to specific queries, thus implementing a query-sensitive proposal generation strategy. Second, we enhanced the model's ability to differentiate adjacent candidate proposals through the multi-temporal-span matching network, which facilitated selecting the most accurate proposal results under various time scales. Experiments on three widely used benchmarks, Charades-STA, TACoS and ActivityNet Captions, our approach demonstrated significant improvements over state-of-the-art methods, indicating promising advancements in video grounding.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624002932","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The video grounding task aims to retrieve moments from the videos corresponding to a given textual query. This task poses significant challenges because of the need to comprehend the semantic content of both videos and sentences as well as manage the matching relationship between modalities. Existing approaches struggle to effectively meet this challenge, as they often lack consideration for the diversity in constructing proposals to fit segments from varied scenes and disregard the multi-temporal scale matching relationship between queries and proposals. In this paper, we propose the Query-Sensitive Proposal and Multi-Temporal-Span Matching Network (QSMT-Net), an innovative framework designed to generate more distinctive proposals and to enhance the matching between queries and candidate proposals over varying temporal spans. First, we fortify the connection between modes by instituting fine-grained interactions between video clips and textual words. Subsequently, through a learnable pooling mechanism, we dynamically construct candidate proposals tailored to specific queries, thus implementing a query-sensitive proposal generation strategy. Second, we enhanced the model's ability to differentiate adjacent candidate proposals through the multi-temporal-span matching network, which facilitated selecting the most accurate proposal results under various time scales. Experiments on three widely used benchmarks, Charades-STA, TACoS and ActivityNet Captions, our approach demonstrated significant improvements over state-of-the-art methods, indicating promising advancements in video grounding.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.