FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-07-08 DOI:10.1016/j.cviu.2025.104442

Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu

{"title":"FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning","authors":"Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu","doi":"10.1016/j.cviu.2025.104442","DOIUrl":null,"url":null,"abstract":"<div><div>Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104442"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001651","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.

查看原文本刊更多论文

多模态联邦学习的视觉感知潜在提示生成

近年来的研究表明，基于类clip模型的提示学习在各种图像识别和检测任务中表现优异，因此已被应用于多模态联邦学习（MMFL）中。联邦提示学习（FPL）作为MMFL的一个技术分支，使客户机和服务器能够在通信期间交换提示而不是模型参数，以解决数据异构和高培训成本等挑战。许多现有的FPL方法严重依赖于预训练的视觉语言模型，这使得它们难以处理新的和真正的专业领域数据。为了进一步提高FPL的泛化能力，同时不影响客户的个性化，我们提出了一个新的框架，该框架在视觉语义的指导下生成提示，以更好地处理专业和小规模数据。在我们的方法中，每个客户端使用融合编码器和ie模块生成视觉感知的潜在提示，从而能够学习细粒度的知识。通过联邦计算，客户机协作维护全局提示，从而允许学习粗粒度的知识。FedVLP消除了对手动设计提示模板的依赖，并在七个数据集（包括CIFAR-10、CIFAR-100、Caltech-101、FLIndustry-100等）上展示了卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems