视觉语言模型的双模态个体意识提示调优

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-04-08 DOI:10.1109/TPAMI.2025.3557780

Hantao Yao;Rui Zhang;Huaihai Lyu;Yongdong Zhang;Changsheng Xu

{"title":"视觉语言模型的双模态个体意识提示调优","authors":"Hantao Yao;Rui Zhang;Huaihai Lyu;Yongdong Zhang;Changsheng Xu","doi":"10.1109/TPAMI.2025.3557780","DOIUrl":null,"url":null,"abstract":"Prompt tuning is a valuable technique for adapting visual language models (VLMs) to different downstream tasks, such as domain generalization and learning from a few examples. Previous methods have utilized Context Optimization approaches to deduce domain-shared or cross-modality prompt tokens, which enhance generalization and discriminative ability in textual or visual contexts. However, these prompt tokens, inferred from training data, cannot adapt perfectly to the distribution of the test dataset. This work introduces a novel approach called Bi-modality Individual-aware Prompt Tuning (BIP) by explicitly incorporating the individual's essential prior knowledge into the learnable prompt to enhance their discriminability and generalization. The critical insight of BIP involves applying the Textual Knowledge Embedding (TKE) and Visual Knowledge Embedding (VKE) models to project the class-aware textual essential knowledge and the instance-aware essential knowledge into the class-aware prompt and instance-aware prompt, referred to as Textual-level Class-aware Prompt tuning (TCP) and Visual-level Instance-aware Prompt tuning (VIP). On the one hand, TCP integrates the generated class-aware prompts into the Text Encoder to produce a dynamic class-aware classifier to improve generalization on unseen domains. On the other hand, VIP uses the instance-aware prompt to generate the dynamic visual embedding of each instance, thereby enhancing the discriminative capability of visual embedding. Comprehensive evaluations demonstrate that BIP can be used as a plug-and-play module easily integrated with existing methods and achieves superior performance on 15 benchmarks across four tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6352-6368"},"PeriodicalIF":18.6000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bi-Modality Individual-Aware Prompt Tuning for Visual-Language Model\",\"authors\":\"Hantao Yao;Rui Zhang;Huaihai Lyu;Yongdong Zhang;Changsheng Xu\",\"doi\":\"10.1109/TPAMI.2025.3557780\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Prompt tuning is a valuable technique for adapting visual language models (VLMs) to different downstream tasks, such as domain generalization and learning from a few examples. Previous methods have utilized Context Optimization approaches to deduce domain-shared or cross-modality prompt tokens, which enhance generalization and discriminative ability in textual or visual contexts. However, these prompt tokens, inferred from training data, cannot adapt perfectly to the distribution of the test dataset. This work introduces a novel approach called Bi-modality Individual-aware Prompt Tuning (BIP) by explicitly incorporating the individual's essential prior knowledge into the learnable prompt to enhance their discriminability and generalization. The critical insight of BIP involves applying the Textual Knowledge Embedding (TKE) and Visual Knowledge Embedding (VKE) models to project the class-aware textual essential knowledge and the instance-aware essential knowledge into the class-aware prompt and instance-aware prompt, referred to as Textual-level Class-aware Prompt tuning (TCP) and Visual-level Instance-aware Prompt tuning (VIP). On the one hand, TCP integrates the generated class-aware prompts into the Text Encoder to produce a dynamic class-aware classifier to improve generalization on unseen domains. On the other hand, VIP uses the instance-aware prompt to generate the dynamic visual embedding of each instance, thereby enhancing the discriminative capability of visual embedding. Comprehensive evaluations demonstrate that BIP can be used as a plug-and-play module easily integrated with existing methods and achieves superior performance on 15 benchmarks across four tasks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 8\",\"pages\":\"6352-6368\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949734/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10949734/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

提示调优是一种有价值的技术，用于使可视化语言模型（vlm）适应不同的下游任务，例如领域泛化和从几个示例中学习。以前的方法利用上下文优化方法来推导领域共享或跨模态提示符号，从而提高了文本或视觉上下文的泛化和判别能力。然而，这些从训练数据推断出来的提示符号不能完全适应测试数据集的分布。本文提出了一种新的方法，即双模态个体意识提示调谐（BIP），该方法将个体的基本先验知识明确地纳入可学习提示中，以增强其可辨别性和泛化性。BIP的关键洞察包括应用文本知识嵌入（TKE）和视觉知识嵌入（VKE）模型，将类感知的文本基本知识和实例感知的基本知识投影到类感知提示和实例感知提示中，称为文本级类感知提示调优（TCP）和视觉级实例感知提示调优（VIP）。一方面，TCP将生成的类感知提示集成到Text Encoder中，生成一个动态的类感知分类器，以提高对不可见域的泛化。另一方面，VIP利用实例感知提示生成每个实例的动态视觉嵌入，从而增强了视觉嵌入的判别能力。综合评估表明，BIP可以作为即插即用模块，与现有方法轻松集成，并在4项任务的15个基准测试中取得优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bi-Modality Individual-Aware Prompt Tuning for Visual-Language Model

Prompt tuning is a valuable technique for adapting visual language models (VLMs) to different downstream tasks, such as domain generalization and learning from a few examples. Previous methods have utilized Context Optimization approaches to deduce domain-shared or cross-modality prompt tokens, which enhance generalization and discriminative ability in textual or visual contexts. However, these prompt tokens, inferred from training data, cannot adapt perfectly to the distribution of the test dataset. This work introduces a novel approach called Bi-modality Individual-aware Prompt Tuning (BIP) by explicitly incorporating the individual's essential prior knowledge into the learnable prompt to enhance their discriminability and generalization. The critical insight of BIP involves applying the Textual Knowledge Embedding (TKE) and Visual Knowledge Embedding (VKE) models to project the class-aware textual essential knowledge and the instance-aware essential knowledge into the class-aware prompt and instance-aware prompt, referred to as Textual-level Class-aware Prompt tuning (TCP) and Visual-level Instance-aware Prompt tuning (VIP). On the one hand, TCP integrates the generated class-aware prompts into the Text Encoder to produce a dynamic class-aware classifier to improve generalization on unseen domains. On the other hand, VIP uses the instance-aware prompt to generate the dynamic visual embedding of each instance, thereby enhancing the discriminative capability of visual embedding. Comprehensive evaluations demonstrate that BIP can be used as a plug-and-play module easily integrated with existing methods and achieves superior performance on 15 benchmarks across four tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量