{"title":"混合粗粒度和细粒度的视觉语言模型提示调优","authors":"Yansheng Gao , Zixi Zhu , Shengsheng Wang","doi":"10.1016/j.patcog.2025.112074","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112074"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mixture of coarse and fine-grained prompt tuning for vision-language model\",\"authors\":\"Yansheng Gao , Zixi Zhu , Shengsheng Wang\",\"doi\":\"10.1016/j.patcog.2025.112074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 112074\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325007344\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325007344","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Mixture of coarse and fine-grained prompt tuning for vision-language model
Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT), a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a Mixed Fusion Module (MFM) to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a Dynamic Refinement Adapter (DRA) is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.