文本视频检索的动态语义原型感知

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-27 DOI:10.1016/j.imavis.2025.105515

Henghao Zhao, Rui Yan, Zechao Li

{"title":"文本视频检索的动态语义原型感知","authors":"Henghao Zhao, Rui Yan, Zechao Li","doi":"10.1016/j.imavis.2025.105515","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105515"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic semantic prototype perception for text–video retrieval\",\"authors\":\"Henghao Zhao, Rui Yan, Zechao Li\",\"doi\":\"10.1016/j.imavis.2025.105515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105515\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001039\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001039","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

局部视觉区域与文本描述之间的语义对齐是解决细粒度文本视频检索任务的一种很有前途的方法。然而，现有方法依赖于附加目标检测器作为显式监督，不利于实际应用。为此，提出了一种新的动态语义原型感知（DSP Perception），该感知在没有任何显式监督的情况下自动学习、构建和推断视觉区域与文本单词之间的动态时空依赖关系。具体来说，DSP感知由三个部分组成：空间语义解析模块、时空语义关联模块和跨模态语义原型对齐模块。利用空间语义解析模块对视觉斑块进行量化，减少视觉多样性，有利于后续对相似语义区域进行聚合。引入时空语义关联模块，学习相邻帧之间的动态信息，聚合视频中属于同一语义的局部特征。此外，提出了一种新的全局到局部的跨模态语义原型对齐策略，为动态语义原型的跨模态感知提供了时空线索。因此，所提出的DSP感知能够捕获视频中的局部区域及其动态信息。在四个广泛使用的数据集（MSR-VTT， MSVD， ActivityNet-Caption和DiDeMo）上进行的大量实验通过与几种最先进的方法进行比较，证明了所提出的DSP感知的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dynamic semantic prototype perception for text–video retrieval

Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.