Vision-by-prompt: Context-aware dual prompts for composed video retrieval

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-01 DOI:10.1016/j.patcog.2025.112378

Hao Wang , Fang Liu , Licheng Jiao , Jiahao Wang , Shuo Li , Lingling Li , Puhua Chen , Xu Liu

{"title":"Vision-by-prompt: Context-aware dual prompts for composed video retrieval","authors":"Hao Wang , Fang Liu , Licheng Jiao , Jiahao Wang , Shuo Li , Lingling Li , Puhua Chen , Xu Liu","doi":"10.1016/j.patcog.2025.112378","DOIUrl":null,"url":null,"abstract":"<div><div>Composed video retrieval (CoVR) is a challenging task of retrieving relevant videos in a corpus by using a query that integrates both a relative change text and a reference video. Most existing CoVR models simply rely on the late-fusion strategy to combine visual and change text. Furthermore, various methods have been proposed to generate pseudo-word tokens from the reference video, which are then integrated into the relative change text for CoVR. However, these pseudo-word-based techniques exhibit limitations when the target video involves complex changes from the reference video, <em>e.g.</em>, object removal. In this work, we propose a novel CoVR framework that learns context information via context-aware dual prompts for relative change text to achieve effective composed video retrieval. The dual prompts cater to two aspects: 1) Global descriptive prompts generated from the pretrained V-L models, <em>e.g.</em>, BLIP-2, to get concise textual representations of the reference video. 2) Local target prompts to learn the target representations that the change text pays attention to. By connecting these prompts with relative change text, one can easily use existing text-to-video retrieval models to enhance CoVR performance. Our proposed framework can be flexibly used for both composed video retrieval (CoVR) and composed image retrieval (CoIR) tasks. Moreover, we take a pioneering approach by adopting the CoVR model to achieve zero-shot CoIR for remote sensing. Experiments on four datasets show that our approach achieves state-of-the-art performance in both CoVR and zero-shot CoIR tasks, with improvements of as high as around 3.5 % in terms of recall@K=1 score.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112378"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010398","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Composed video retrieval (CoVR) is a challenging task of retrieving relevant videos in a corpus by using a query that integrates both a relative change text and a reference video. Most existing CoVR models simply rely on the late-fusion strategy to combine visual and change text. Furthermore, various methods have been proposed to generate pseudo-word tokens from the reference video, which are then integrated into the relative change text for CoVR. However, these pseudo-word-based techniques exhibit limitations when the target video involves complex changes from the reference video, e.g., object removal. In this work, we propose a novel CoVR framework that learns context information via context-aware dual prompts for relative change text to achieve effective composed video retrieval. The dual prompts cater to two aspects: 1) Global descriptive prompts generated from the pretrained V-L models, e.g., BLIP-2, to get concise textual representations of the reference video. 2) Local target prompts to learn the target representations that the change text pays attention to. By connecting these prompts with relative change text, one can easily use existing text-to-video retrieval models to enhance CoVR performance. Our proposed framework can be flexibly used for both composed video retrieval (CoVR) and composed image retrieval (CoIR) tasks. Moreover, we take a pioneering approach by adopting the CoVR model to achieve zero-shot CoIR for remote sensing. Experiments on four datasets show that our approach achieves state-of-the-art performance in both CoVR and zero-shot CoIR tasks, with improvements of as high as around 3.5 % in terms of recall@K=1 score.

查看原文本刊更多论文

视觉提示：上下文感知双提示组合视频检索

组合视频检索（CoVR）是一项具有挑战性的任务，它通过使用集成了相对变化文本和参考视频的查询来检索语料库中的相关视频。大多数现有的CoVR模型仅仅依赖于后期融合策略来结合视觉和变化文本。此外，还提出了各种方法从参考视频中生成伪词标记，然后将其集成到CoVR的相对变化文本中。然而，当目标视频涉及与参考视频的复杂变化时，这些基于伪词的技术表现出局限性，例如，对象删除。在这项工作中，我们提出了一种新的CoVR框架，该框架通过上下文感知的相对变化文本双提示来学习上下文信息，以实现有效的组合视频检索。双提示满足两个方面的需求：1)由预训练的V-L模型（如BLIP-2）生成的全局描述性提示，以获得参考视频的简明文本表示。2)局部目标提示学习变化文本所关注的目标表征。通过将这些提示与相对更改文本连接起来，可以很容易地使用现有的文本到视频检索模型来增强CoVR性能。该框架可以灵活地用于组合视频检索（CoVR）和组合图像检索（CoIR）任务。此外，我们采用了一种开创性的方法，采用CoVR模型实现遥感零射击CoIR。在四个数据集上的实验表明，我们的方法在CoVR和零射击CoIR任务中都达到了最先进的性能，在recall@K=1得分方面的改进高达3.5%左右。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.