Hao Wang , Fang Liu , Licheng Jiao , Jiahao Wang , Shuo Li , Lingling Li , Puhua Chen , Xu Liu
{"title":"视觉提示:上下文感知双提示组合视频检索","authors":"Hao Wang , Fang Liu , Licheng Jiao , Jiahao Wang , Shuo Li , Lingling Li , Puhua Chen , Xu Liu","doi":"10.1016/j.patcog.2025.112378","DOIUrl":null,"url":null,"abstract":"<div><div>Composed video retrieval (CoVR) is a challenging task of retrieving relevant videos in a corpus by using a query that integrates both a relative change text and a reference video. Most existing CoVR models simply rely on the late-fusion strategy to combine visual and change text. Furthermore, various methods have been proposed to generate pseudo-word tokens from the reference video, which are then integrated into the relative change text for CoVR. However, these pseudo-word-based techniques exhibit limitations when the target video involves complex changes from the reference video, <em>e.g.</em>, object removal. In this work, we propose a novel CoVR framework that learns context information via context-aware dual prompts for relative change text to achieve effective composed video retrieval. The dual prompts cater to two aspects: 1) Global descriptive prompts generated from the pretrained V-L models, <em>e.g.</em>, BLIP-2, to get concise textual representations of the reference video. 2) Local target prompts to learn the target representations that the change text pays attention to. By connecting these prompts with relative change text, one can easily use existing text-to-video retrieval models to enhance CoVR performance. Our proposed framework can be flexibly used for both composed video retrieval (CoVR) and composed image retrieval (CoIR) tasks. Moreover, we take a pioneering approach by adopting the CoVR model to achieve zero-shot CoIR for remote sensing. Experiments on four datasets show that our approach achieves state-of-the-art performance in both CoVR and zero-shot CoIR tasks, with improvements of as high as around 3.5 % in terms of recall@K=1 score.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112378"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision-by-prompt: Context-aware dual prompts for composed video retrieval\",\"authors\":\"Hao Wang , Fang Liu , Licheng Jiao , Jiahao Wang , Shuo Li , Lingling Li , Puhua Chen , Xu Liu\",\"doi\":\"10.1016/j.patcog.2025.112378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Composed video retrieval (CoVR) is a challenging task of retrieving relevant videos in a corpus by using a query that integrates both a relative change text and a reference video. Most existing CoVR models simply rely on the late-fusion strategy to combine visual and change text. Furthermore, various methods have been proposed to generate pseudo-word tokens from the reference video, which are then integrated into the relative change text for CoVR. However, these pseudo-word-based techniques exhibit limitations when the target video involves complex changes from the reference video, <em>e.g.</em>, object removal. In this work, we propose a novel CoVR framework that learns context information via context-aware dual prompts for relative change text to achieve effective composed video retrieval. The dual prompts cater to two aspects: 1) Global descriptive prompts generated from the pretrained V-L models, <em>e.g.</em>, BLIP-2, to get concise textual representations of the reference video. 2) Local target prompts to learn the target representations that the change text pays attention to. By connecting these prompts with relative change text, one can easily use existing text-to-video retrieval models to enhance CoVR performance. Our proposed framework can be flexibly used for both composed video retrieval (CoVR) and composed image retrieval (CoIR) tasks. Moreover, we take a pioneering approach by adopting the CoVR model to achieve zero-shot CoIR for remote sensing. Experiments on four datasets show that our approach achieves state-of-the-art performance in both CoVR and zero-shot CoIR tasks, with improvements of as high as around 3.5 % in terms of recall@K=1 score.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112378\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325010398\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010398","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Vision-by-prompt: Context-aware dual prompts for composed video retrieval
Composed video retrieval (CoVR) is a challenging task of retrieving relevant videos in a corpus by using a query that integrates both a relative change text and a reference video. Most existing CoVR models simply rely on the late-fusion strategy to combine visual and change text. Furthermore, various methods have been proposed to generate pseudo-word tokens from the reference video, which are then integrated into the relative change text for CoVR. However, these pseudo-word-based techniques exhibit limitations when the target video involves complex changes from the reference video, e.g., object removal. In this work, we propose a novel CoVR framework that learns context information via context-aware dual prompts for relative change text to achieve effective composed video retrieval. The dual prompts cater to two aspects: 1) Global descriptive prompts generated from the pretrained V-L models, e.g., BLIP-2, to get concise textual representations of the reference video. 2) Local target prompts to learn the target representations that the change text pays attention to. By connecting these prompts with relative change text, one can easily use existing text-to-video retrieval models to enhance CoVR performance. Our proposed framework can be flexibly used for both composed video retrieval (CoVR) and composed image retrieval (CoIR) tasks. Moreover, we take a pioneering approach by adopting the CoVR model to achieve zero-shot CoIR for remote sensing. Experiments on four datasets show that our approach achieves state-of-the-art performance in both CoVR and zero-shot CoIR tasks, with improvements of as high as around 3.5 % in terms of recall@K=1 score.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.