混合粗粒度和细粒度的视觉语言模型提示调优

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yansheng Gao , Zixi Zhu , Shengsheng Wang
{"title":"混合粗粒度和细粒度的视觉语言模型提示调优","authors":"Yansheng Gao ,&nbsp;Zixi Zhu ,&nbsp;Shengsheng Wang","doi":"10.1016/j.patcog.2025.112074","DOIUrl":null,"url":null,"abstract":"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112074"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mixture of coarse and fine-grained prompt tuning for vision-language model\",\"authors\":\"Yansheng Gao ,&nbsp;Zixi Zhu ,&nbsp;Shengsheng Wang\",\"doi\":\"10.1016/j.patcog.2025.112074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose <strong>Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT)</strong>, a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a <strong>Mixed Fusion Module (MFM)</strong> to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a <strong>Dynamic Refinement Adapter (DRA)</strong> is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 112074\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325007344\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325007344","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

视觉语言模型(VLMs)在各种任务中表现出令人印象深刻的性能,但当将其转移到具有有限计算样本的下游任务时,往往会受到先验知识退化的影响。提示调优方法是缓解此问题的有效解决方案。然而,大多数现有方法仅依赖于粗粒度文本提示或细粒度文本提示,这可能会限制vlm的判别和泛化能力。为了解决这些限制,我们提出了粗粒度和细粒度的混合提示调优(MCFPT),这是一种集成粗粒度和细粒度提示以提高vlm性能的新方法。MCFPT受混合专家(mix -of- experts, MoE)机制的启发,引入混合融合模块(Mixed Fusion Module, MFM),对粗糙的领域共享文本特征和细粒度的类别区分文本特征进行融合选择,得到混合特征。此外,还引入了一个动态细化适配器(Dynamic refine Adapter, DRA)来调整类别分布,确保细化和混合文本特性之间的一致性。这些组件共同提高了vlm的泛化和判别能力。在四种场景(从基础到新、少样本分类、域泛化和跨域分类)中进行的广泛实验表明,与最先进的方法相比,MCFPT取得了卓越的性能,在多个数据集上的HM分数有了显著提高。我们的研究结果强调了MCFPT作为一种强大的方法,可以提高视觉语言模型在不同应用领域的适应性和效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Mixture of coarse and fine-grained prompt tuning for vision-language model
Visual Language Models (VLMs) exhibit impressive performance across various tasks but often suffer from degradation of prior knowledge when transferred to downstream tasks with limited computational samples. Prompt tuning methods emerge as an effective solution to mitigate this issue. However, most existing approaches solely rely on coarse-grained text prompt or fine-grained text prompt, which may limit the discriminative and generalization capabilities of VLMs. To address these limitations, we propose Mixture of Coarse and Fine-grained Prompt Tuning (MCFPT), a novel method that integrates both coarse and fine-grained prompts to enhance the performance of VLMs. Inspired by the Mixture-of-Experts (MoE) mechanism, MCFPT incorporates a Mixed Fusion Module (MFM) to fuse and select coarse domain-shared text feature and fine-grained category-discriminative text feature to get the mixed feature. Additionally, a Dynamic Refinement Adapter (DRA) is introduced to adjust category distributions, ensuring consistency between refined and mixed text features. These components collectively improve the generalization and discriminative power of VLMs. Extensive experiments across four scenarios-base-to-new, few-shot classification, domain generalization, and cross-domain classification-demonstrate that MCFPT achieves exceptional performance compared to state-of-the-art methods, with significant improvements in HM scores across multiple datasets. Our findings highlight MCFPT as a robust approach for improving the adaptability and efficiency of Visual Language Models in diverse application domains.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信