测试时视觉语言模型的任务到实例提示学习。

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-14 DOI:10.1109/TIP.2025.3546840

Zhihe Lu;Jiawang Bai;Xin Li;Zeyu Xiao;Xinchao Wang

{"title":"测试时视觉语言模型的任务到实例提示学习。","authors":"Zhihe Lu;Jiawang Bai;Xin Li;Zeyu Xiao;Xinchao Wang","doi":"10.1109/TIP.2025.3546840","DOIUrl":null,"url":null,"abstract":"Prompt learning has been recently introduced into the adaption of pre-trained vision-language models (VLMs) by tuning a set of trainable tokens to replace hand-crafted text templates. Despite the encouraging results achieved, existing methods largely rely on extra annotated data for training. In this paper, we investigate a more realistic scenario, where only the unlabeled test data is available. Existing test-time prompt learning methods often separately learn a prompt for each test sample. However, relying solely on a single sample heavily limits the performance of the learned prompts, as it neglects the task-level knowledge that can be gained from multiple samples. To that end, we propose a novel test-time prompt learning method of VLMs, called Task-to-Instance PromPt LEarning (TIPPLE), which adopts a two-stage training strategy to leverage both task- and instance-level knowledge. Specifically, we reformulate the effective online pseudo-labeling paradigm along with two tailored components: an auxiliary text classification task and a diversity regularization term, to serve the task-oriented prompt learning. After that, the learned task-level prompt is further combined with a tunable residual for each test sample to integrate with instance-level knowledge. We demonstrate the superior performance of TIPPLE on 15 downstream datasets, e.g., the average improvement of 1.87% over the state-of-the-art method, using ViT-B/16 visual backbone. Our code is open-sourced at <uri>https://github.com/zhiheLu/TIPPLE</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1908-1920"},"PeriodicalIF":13.7000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Task-to-Instance Prompt Learning for Vision-Language Models at Test Time\",\"authors\":\"Zhihe Lu;Jiawang Bai;Xin Li;Zeyu Xiao;Xinchao Wang\",\"doi\":\"10.1109/TIP.2025.3546840\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Prompt learning has been recently introduced into the adaption of pre-trained vision-language models (VLMs) by tuning a set of trainable tokens to replace hand-crafted text templates. Despite the encouraging results achieved, existing methods largely rely on extra annotated data for training. In this paper, we investigate a more realistic scenario, where only the unlabeled test data is available. Existing test-time prompt learning methods often separately learn a prompt for each test sample. However, relying solely on a single sample heavily limits the performance of the learned prompts, as it neglects the task-level knowledge that can be gained from multiple samples. To that end, we propose a novel test-time prompt learning method of VLMs, called Task-to-Instance PromPt LEarning (TIPPLE), which adopts a two-stage training strategy to leverage both task- and instance-level knowledge. Specifically, we reformulate the effective online pseudo-labeling paradigm along with two tailored components: an auxiliary text classification task and a diversity regularization term, to serve the task-oriented prompt learning. After that, the learned task-level prompt is further combined with a tunable residual for each test sample to integrate with instance-level knowledge. We demonstrate the superior performance of TIPPLE on 15 downstream datasets, e.g., the average improvement of 1.87% over the state-of-the-art method, using ViT-B/16 visual backbone. Our code is open-sourced at <uri>https://github.com/zhiheLu/TIPPLE</uri>.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"1908-1920\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10925517/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10925517/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近，提示学习被引入到预训练视觉语言模型（VLM）的调整中，通过调整一组可训练的标记来替代手工制作的文本模板。尽管取得了令人鼓舞的成果，但现有方法在很大程度上依赖于额外的注释数据进行训练。在本文中，我们研究了一种更现实的情况，即只有未标注的测试数据可用。现有的测试时间提示学习方法通常为每个测试样本单独学习一个提示。然而，仅仅依赖单一样本严重限制了所学提示的性能，因为它忽略了从多个样本中获得的任务级知识。为此，我们提出了一种新颖的 VLM 测试时间提示学习方法，称为任务到实例的提示学习（TIPPLE），它采用两阶段训练策略来利用任务和实例级知识。具体来说，我们将有效的在线伪标记范式与两个定制组件（辅助文本分类任务和多样性正则化项）一起重新制定，以服务于面向任务的提示学习。之后，学习到的任务级提示会进一步与每个测试样本的可调残差相结合，以整合实例级知识。我们在 15 个下游数据集上证明了 TIPPLE 的优越性能，例如，使用 ViT-B/16 视觉骨干，TIPPLE 比最先进的方法平均提高了 1.87%。我们的代码开源于 https://github.com/zhiheLu/TIPPLE。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Task-to-Instance Prompt Learning for Vision-Language Models at Test Time

Prompt learning has been recently introduced into the adaption of pre-trained vision-language models (VLMs) by tuning a set of trainable tokens to replace hand-crafted text templates. Despite the encouraging results achieved, existing methods largely rely on extra annotated data for training. In this paper, we investigate a more realistic scenario, where only the unlabeled test data is available. Existing test-time prompt learning methods often separately learn a prompt for each test sample. However, relying solely on a single sample heavily limits the performance of the learned prompts, as it neglects the task-level knowledge that can be gained from multiple samples. To that end, we propose a novel test-time prompt learning method of VLMs, called Task-to-Instance PromPt LEarning (TIPPLE), which adopts a two-stage training strategy to leverage both task- and instance-level knowledge. Specifically, we reformulate the effective online pseudo-labeling paradigm along with two tailored components: an auxiliary text classification task and a diversity regularization term, to serve the task-oriented prompt learning. After that, the learned task-level prompt is further combined with a tunable residual for each test sample to integrate with instance-level knowledge. We demonstrate the superior performance of TIPPLE on 15 downstream datasets, e.g., the average improvement of 1.87% over the state-of-the-art method, using ViT-B/16 visual backbone. Our code is open-sourced at https://github.com/zhiheLu/TIPPLE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量