{"title":"文本视频检索的动态语义原型感知","authors":"Henghao Zhao, Rui Yan, Zechao Li","doi":"10.1016/j.imavis.2025.105515","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105515"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic semantic prototype perception for text–video retrieval\",\"authors\":\"Henghao Zhao, Rui Yan, Zechao Li\",\"doi\":\"10.1016/j.imavis.2025.105515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105515\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001039\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001039","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Dynamic semantic prototype perception for text–video retrieval
Semantic alignment between local visual regions and textual description is a promising solution for fine-grained text–video retrieval task. However, existing methods rely on the additional object detector as the explicit supervision, which is unfriendly to real application. To this end, a novel Dynamic Semantic Prototype Perception (DSP Perception) is proposed that automatically learns, constructs and infers the dynamic spatio-temporal dependencies between visual regions and text words without any explicit supervision. Specifically, DSP Perception consists of three components: the spatial semantic parsing module, the spatio-temporal semantic correlation module and the cross-modal semantic prototype alignment. The spatial semantic parsing module is leveraged to quantize visual patches to reduce the visual diversity, which helps to subsequently aggregate the similar semantic regions. The spatio-temporal semantic correlation module is introduced to learn dynamic information between adjacent frames and aggregate local features belonging to the same semantic in the video as tube. In addition, a novel global-to-local alignment strategy is proposed for the cross-modal semantic prototype alignment, which provides spatio-temporal cues for cross-modal perception of dynamic semantic prototypes. Thus, the proposed DSP Perception enables to capture local regions and their dynamic information within the video. Extensive experiments conducted on four widely-used datasets (MSR-VTT, MSVD, ActivityNet-Caption and DiDeMo) demonstrate the effectiveness of the proposed DSP Perception by comparison with several state-of-the-art methods.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.