Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu
{"title":"FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning","authors":"Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu","doi":"10.1016/j.cviu.2025.104442","DOIUrl":null,"url":null,"abstract":"<div><div>Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104442"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001651","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems