{"title":"Transferable dual multi-granularity semantic excavating for partially relevant video retrieval","authors":"","doi":"10.1016/j.imavis.2024.105168","DOIUrl":null,"url":null,"abstract":"<div><p>Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reality. The existing methods excavate video-text semantic consistency information insufficiently and lack the capacity to highlight the semantics of key representations. To tackle these issues, we propose a transferable dual multi-granularity semantic excavating network, called T-D3N, to focus on enhancing the learning of dual-modal representations. Specifically, we first introduce a novel transferable textual semantic learning strategy by designing Adaptive Multi-scale Semantic Mining (AMSM) component to excavate significant textual semantic from multiple perspectives. Second, T-D3N distinguishes the feature differences from the frame-wise perspective to better perform contrastive learning between positive and negative samples in the video feature domain, which can further distance the positive and negative samples and improve the probability of positive samples being retrieved by query. Finally, our model constructs multi-grained video temporal dependencies and conducts cross-grained core feature perception, which enables more sufficient multimodal interactions. Extensive experiments are performed on three benchmarks, i.e., ActivityNet Captions, Charades-STA, and TVR, our T-D3N achieves state-of-the-art results. Furthermore, we also confirm that our model is transferable on a broad range of multimodal tasks such as T2VR, VMR, and MMSum.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624002737","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reality. The existing methods excavate video-text semantic consistency information insufficiently and lack the capacity to highlight the semantics of key representations. To tackle these issues, we propose a transferable dual multi-granularity semantic excavating network, called T-D3N, to focus on enhancing the learning of dual-modal representations. Specifically, we first introduce a novel transferable textual semantic learning strategy by designing Adaptive Multi-scale Semantic Mining (AMSM) component to excavate significant textual semantic from multiple perspectives. Second, T-D3N distinguishes the feature differences from the frame-wise perspective to better perform contrastive learning between positive and negative samples in the video feature domain, which can further distance the positive and negative samples and improve the probability of positive samples being retrieved by query. Finally, our model constructs multi-grained video temporal dependencies and conducts cross-grained core feature perception, which enables more sufficient multimodal interactions. Extensive experiments are performed on three benchmarks, i.e., ActivityNet Captions, Charades-STA, and TVR, our T-D3N achieves state-of-the-art results. Furthermore, we also confirm that our model is transferable on a broad range of multimodal tasks such as T2VR, VMR, and MMSum.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.