文本视频检索的动态语义原型感知

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Henghao Zhao, Rui Yan, Zechao Li
{"title":"文本视频检索的动态语义原型感知","authors":"Henghao Zhao,&nbsp;Rui Yan,&nbsp;Zechao Li","doi":"10.1016/j.imavis.2025.105515","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105515"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic semantic prototype perception for text–video retrieval\",\"authors\":\"Henghao Zhao,&nbsp;Rui Yan,&nbsp;Zechao Li\",\"doi\":\"10.1016/j.imavis.2025.105515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105515\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001039\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001039","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

局部视觉区域与文本描述之间的语义对齐是解决细粒度文本视频检索任务的一种很有前途的方法。然而,现有方法依赖于附加目标检测器作为显式监督,不利于实际应用。为此,提出了一种新的动态语义原型感知(DSP Perception),该感知在没有任何显式监督的情况下自动学习、构建和推断视觉区域与文本单词之间的动态时空依赖关系。具体来说,DSP感知由三个部分组成:空间语义解析模块、时空语义关联模块和跨模态语义原型对齐模块。利用空间语义解析模块对视觉斑块进行量化,减少视觉多样性,有利于后续对相似语义区域进行聚合。引入时空语义关联模块,学习相邻帧之间的动态信息,聚合视频中属于同一语义的局部特征。此外,提出了一种新的全局到局部的跨模态语义原型对齐策略,为动态语义原型的跨模态感知提供了时空线索。因此,所提出的DSP感知能够捕获视频中的局部区域及其动态信息。在四个广泛使用的数据集(MSR-VTT, MSVD, ActivityNet-Caption和DiDeMo)上进行的大量实验通过与几种最先进的方法进行比较,证明了所提出的DSP感知的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dynamic semantic prototype perception for text–video retrieval
Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信