混合粗粒度和细粒度的视觉语言模型提示调优

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-07-02 DOI:10.1016/j.patcog.2025.112074

Yansheng Gao , Zixi Zhu , Shengsheng Wang

{"title":"混合粗粒度和细粒度的视觉语言模型提示调优","authors":"Yansheng Gao , Zixi Zhu , Shengsheng Wang","doi":"10.1016/j.patcog.2025.112074","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112074"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mixture of coarse and fine-grained prompt tuning for vision-language model\",\"authors\":\"Yansheng Gao , Zixi Zhu , Shengsheng Wang\",\"doi\":\"10.1016/j.patcog.2025.112074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 112074\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325007344\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325007344","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言模型（VLMs）在各种任务中表现出令人印象深刻的性能，但当将其转移到具有有限计算样本的下游任务时，往往会受到先验知识退化的影响。提示调优方法是缓解此问题的有效解决方案。然而，大多数现有方法仅依赖于粗粒度文本提示或细粒度文本提示，这可能会限制vlm的判别和泛化能力。为了解决这些限制，我们提出了粗粒度和细粒度的混合提示调优（MCFPT），这是一种集成粗粒度和细粒度提示以提高vlm性能的新方法。MCFPT受混合专家（mix -of- experts, MoE）机制的启发，引入混合融合模块（Mixed Fusion Module， MFM），对粗糙的领域共享文本特征和细粒度的类别区分文本特征进行融合选择，得到混合特征。此外，还引入了一个动态细化适配器（Dynamic refine Adapter， DRA）来调整类别分布，确保细化和混合文本特性之间的一致性。这些组件共同提高了vlm的泛化和判别能力。在四种场景（从基础到新、少样本分类、域泛化和跨域分类）中进行的广泛实验表明，与最先进的方法相比，MCFPT取得了卓越的性能，在多个数据集上的HM分数有了显著提高。我们的研究结果强调了MCFPT作为一种强大的方法，可以提高视觉语言模型在不同应用领域的适应性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mixture of coarse and fine-grained prompt tuning for vision-language model

Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT), a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a Mixed Fusion Module (MFM) to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a Dynamic Refinement Adapter (DRA) is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.