{"title":"视觉语言模型的混合提示学习","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":null,"url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\nprominence, numerous studies have attempted to combine VLMs for downstream\ntasks. Among these, prompt learning has been validated as an effective method\nfor adapting to new tasks, which only requiring a small number of parameters.\nHowever, current prompt learning methods face two challenges: first, a single\nsoft prompt struggles to capture the diverse styles and patterns within a\ndataset; second, fine-tuning soft prompts is prone to overfitting. To address\nthese challenges, we propose a mixture of soft prompt learning method\nincorporating a routing module. This module is able to capture a dataset's\nvaried styles and dynamically selects the most suitable prompts for each\ninstance. Additionally, we introduce a novel gating mechanism to ensure the\nrouter selects prompts based on their similarity to hard prompt templates,\nwhich both retaining knowledge from hard prompts and improving selection\naccuracy. We also implement semantically grouped text-level supervision,\ninitializing each soft prompt with the token embeddings of manually designed\ntemplates from its group and applied a contrastive loss between the resulted\ntext feature and hard prompt encoded text feature. This supervision ensures\nthat the text features derived from soft prompts remain close to those from\ntheir corresponding hard prompts, preserving initial knowledge and mitigating\noverfitting. Our method has been validated on 11 datasets, demonstrating\nevident improvements in few-shot learning, domain generalization, and\nbase-to-new generalization scenarios compared to existing baselines. The code\nwill be available at \\url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mixture of Prompt Learning for Vision Language Models\",\"authors\":\"Yu Du, Tong Niu, Rong Zhao\",\"doi\":\"arxiv-2409.12011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As powerful pre-trained vision-language models (VLMs) like CLIP gain\\nprominence, numerous studies have attempted to combine VLMs for downstream\\ntasks. Among these, prompt learning has been validated as an effective method\\nfor adapting to new tasks, which only requiring a small number of parameters.\\nHowever, current prompt learning methods face two challenges: first, a single\\nsoft prompt struggles to capture the diverse styles and patterns within a\\ndataset; second, fine-tuning soft prompts is prone to overfitting. To address\\nthese challenges, we propose a mixture of soft prompt learning method\\nincorporating a routing module. This module is able to capture a dataset's\\nvaried styles and dynamically selects the most suitable prompts for each\\ninstance. Additionally, we introduce a novel gating mechanism to ensure the\\nrouter selects prompts based on their similarity to hard prompt templates,\\nwhich both retaining knowledge from hard prompts and improving selection\\naccuracy. We also implement semantically grouped text-level supervision,\\ninitializing each soft prompt with the token embeddings of manually designed\\ntemplates from its group and applied a contrastive loss between the resulted\\ntext feature and hard prompt encoded text feature. This supervision ensures\\nthat the text features derived from soft prompts remain close to those from\\ntheir corresponding hard prompts, preserving initial knowledge and mitigating\\noverfitting. Our method has been validated on 11 datasets, demonstrating\\nevident improvements in few-shot learning, domain generalization, and\\nbase-to-new generalization scenarios compared to existing baselines. The code\\nwill be available at \\\\url{https://anonymous.4open.science/r/mocoop-6387}\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mixture of Prompt Learning for Vision Language Models
As powerful pre-trained vision-language models (VLMs) like CLIP gain
prominence, numerous studies have attempted to combine VLMs for downstream
tasks. Among these, prompt learning has been validated as an effective method
for adapting to new tasks, which only requiring a small number of parameters.
However, current prompt learning methods face two challenges: first, a single
soft prompt struggles to capture the diverse styles and patterns within a
dataset; second, fine-tuning soft prompts is prone to overfitting. To address
these challenges, we propose a mixture of soft prompt learning method
incorporating a routing module. This module is able to capture a dataset's
varied styles and dynamically selects the most suitable prompts for each
instance. Additionally, we introduce a novel gating mechanism to ensure the
router selects prompts based on their similarity to hard prompt templates,
which both retaining knowledge from hard prompts and improving selection
accuracy. We also implement semantically grouped text-level supervision,
initializing each soft prompt with the token embeddings of manually designed
templates from its group and applied a contrastive loss between the resulted
text feature and hard prompt encoded text feature. This supervision ensures
that the text features derived from soft prompts remain close to those from
their corresponding hard prompts, preserving initial knowledge and mitigating
overfitting. Our method has been validated on 11 datasets, demonstrating
evident improvements in few-shot learning, domain generalization, and
base-to-new generalization scenarios compared to existing baselines. The code
will be available at \url{https://anonymous.4open.science/r/mocoop-6387}