Xingwei Deng , Yangtao Wang , Yanzhao Xie , Xiaocui Li , Maobin Tang , Meie Fang , Wensheng Zhang
{"title":"Prompt-affinity multi-modal class centroids for unsupervised domain adaption","authors":"Xingwei Deng , Yangtao Wang , Yanzhao Xie , Xiaocui Li , Maobin Tang , Meie Fang , Wensheng Zhang","doi":"10.1016/j.patcog.2025.112095","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, the advancements in large vision-language models (VLMs) like CLIP have sparked a renewed interest in leveraging the prompt learning mechanism to preserve semantic consistency between source and target domains in unsupervised domain adaption (UDA). While these approaches show promising results, they encounter fundamental limitations when quantifying the similarity between source and target domain data, primarily stemming from the redundant and modality-missing class centroids. To address these limitations, we propose <u><strong>P</strong></u>rompt-affinity <u><strong>M</strong></u>ulti-modal <u><strong>C</strong></u>lass <u><strong>C</strong></u>entroids for UDA (termed as PMCC). Firstly, we fuse the text class centroids (directly generated from the text encoder of CLIP with manual prompts for each class) and image class centroids (generated from the image encoder of CLIP for each class based on source domain images) to yield the multi-modal class centroids. Secondly, we conduct the cross-attention operation between each source or target domain image and these multi-modal class centroids. In this way, these class centroids that contain rich semantic information of each class will serve as a bridge to effectively measure the semantic similarity between different domains. Finally, we design a logit bias head and employ a multi-modal prompt learning mechanism to accurately predict the true class of each image for both source and target domains. We conduct extensive experiments on 4 popular UDA datasets including Office-31, Office-Home, VisDA-2017, and DomainNet. The experimental results validate our PMCC achieves higher performance with lower model complexity than the state-of-the-art (SOTA) UDA methods. The code of this project is available at GitHub: <span><span>https://github.com/246dxw/PMCC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 112095"},"PeriodicalIF":7.5000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325007551","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, the advancements in large vision-language models (VLMs) like CLIP have sparked a renewed interest in leveraging the prompt learning mechanism to preserve semantic consistency between source and target domains in unsupervised domain adaption (UDA). While these approaches show promising results, they encounter fundamental limitations when quantifying the similarity between source and target domain data, primarily stemming from the redundant and modality-missing class centroids. To address these limitations, we propose Prompt-affinity Multi-modal Class Centroids for UDA (termed as PMCC). Firstly, we fuse the text class centroids (directly generated from the text encoder of CLIP with manual prompts for each class) and image class centroids (generated from the image encoder of CLIP for each class based on source domain images) to yield the multi-modal class centroids. Secondly, we conduct the cross-attention operation between each source or target domain image and these multi-modal class centroids. In this way, these class centroids that contain rich semantic information of each class will serve as a bridge to effectively measure the semantic similarity between different domains. Finally, we design a logit bias head and employ a multi-modal prompt learning mechanism to accurately predict the true class of each image for both source and target domains. We conduct extensive experiments on 4 popular UDA datasets including Office-31, Office-Home, VisDA-2017, and DomainNet. The experimental results validate our PMCC achieves higher performance with lower model complexity than the state-of-the-art (SOTA) UDA methods. The code of this project is available at GitHub: https://github.com/246dxw/PMCC.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.