FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning

IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu
{"title":"FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning","authors":"Hao Pan ,&nbsp;Xiaoli Zhao ,&nbsp;Yuchen Jiang ,&nbsp;Lipeng He ,&nbsp;Bingquan Wang ,&nbsp;Yincan Shu","doi":"10.1016/j.cviu.2025.104442","DOIUrl":null,"url":null,"abstract":"<div><div>Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104442"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001651","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.
多模态联邦学习的视觉感知潜在提示生成
近年来的研究表明,基于类clip模型的提示学习在各种图像识别和检测任务中表现优异,因此已被应用于多模态联邦学习(MMFL)中。联邦提示学习(FPL)作为MMFL的一个技术分支,使客户机和服务器能够在通信期间交换提示而不是模型参数,以解决数据异构和高培训成本等挑战。许多现有的FPL方法严重依赖于预训练的视觉语言模型,这使得它们难以处理新的和真正的专业领域数据。为了进一步提高FPL的泛化能力,同时不影响客户的个性化,我们提出了一个新的框架,该框架在视觉语义的指导下生成提示,以更好地处理专业和小规模数据。在我们的方法中,每个客户端使用融合编码器和ie模块生成视觉感知的潜在提示,从而能够学习细粒度的知识。通过联邦计算,客户机协作维护全局提示,从而允许学习粗粒度的知识。FedVLP消除了对手动设计提示模板的依赖,并在七个数据集(包括CIFAR-10、CIFAR-100、Caltech-101、FLIndustry-100等)上展示了卓越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信