Image and Vision Computing最新文献

筛选
英文 中文
Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models Memory-MambaNav:通过整合时空扫描和状态空间模型来增强目标-目标导航
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-27 DOI: 10.1016/j.imavis.2025.105522
Leyuan Sun , Yusuke Yoshiyasu
{"title":"Memory-MambaNav: Enhancing object-goal navigation through integration of spatial–temporal scanning with state space models","authors":"Leyuan Sun ,&nbsp;Yusuke Yoshiyasu","doi":"10.1016/j.imavis.2025.105522","DOIUrl":"10.1016/j.imavis.2025.105522","url":null,"abstract":"<div><div>Object-goal Navigation (ObjectNav) involves locating a specified target object using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal comprehension about environment during navigation. While earlier approaches focus on spatial modeling, they either do not utilize episodic temporal memory (e.g., keeping track of explored and unexplored spaces) or are computationally prohibitive, as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial–temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of receptive field and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba’s selective scanning capabilities in a cross-temporal manner, enhancing the model’s temporal understanding and interaction with bi-temporal features. We also integrate memory-aggregated egocentric obstacle-awareness embeddings (MEOE) and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba’s potential in ObjectNav, particularly in long-horizon trajectories. All demonstration videos referenced in this paper can be viewed on the webpage (<span><span>https://sunleyuan.github.io/Memory-MambaNav</span><svg><path></path></svg></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105522"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFDW: Distribution-aware Filter and Dynamic Weight for open-mixed-domain Test-time adaptation DFDW:用于开放混域测试时间适应的分布感知过滤器和动态权重
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-27 DOI: 10.1016/j.imavis.2025.105521
Mingwen Shao , Xun Shao , Lingzhuang Meng , Yuanyuan Liu
{"title":"DFDW: Distribution-aware Filter and Dynamic Weight for open-mixed-domain Test-time adaptation","authors":"Mingwen Shao ,&nbsp;Xun Shao ,&nbsp;Lingzhuang Meng ,&nbsp;Yuanyuan Liu","doi":"10.1016/j.imavis.2025.105521","DOIUrl":"10.1016/j.imavis.2025.105521","url":null,"abstract":"<div><div>Test-time adaptation (TTA) aims to adapt the pre-trained model to the unlabeled test data stream during inference. However, existing state-of-the-art TTA methods typically achieve superior performance in closed-set scenarios, and often underperform in more challenging open mixed-domain TTA scenarios. This can be attributed to ignoring two uncertainties: domain non-stationarity and semantic shifts, leading to inaccurate estimation of data distribution and unreliable model confidence. To alleviate the aforementioned issue, we propose a universal TTA method based on a Distribution-aware Filter and Dynamic Weight, called DFDW. Specifically, in order to improve the model’s discriminative ability to data distribution, our DFDW first designs a distribution-aware threshold to filter known and unknown samples from the test data, and then separates them based on contrastive learning. Furthermore, to improve the confidence and generalization of the model, we designed a dynamic weight consisting of category-reliable weight and diversity weight. Among them, category-reliable weight uses prior average predictions to enhance the guidance of high-confidence samples, and diversity weight uses negative information entropy to increase the influence of diversity samples. Based on the above approach, the model can accurately identify the distribution of semantic shift samples, and widely adapt to the diversity samples in the non-stationary domain. Extensive experiments on CIFAR and ImageNet-C benchmarks show the superiority of our DFDW.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105521"},"PeriodicalIF":4.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image–text feature learning for unsupervised visible–infrared person re-identification 无监督可见红外人再识别的图像-文本特征学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-26 DOI: 10.1016/j.imavis.2025.105520
Jifeng Guo , Zhiqi Pang
{"title":"Image–text feature learning for unsupervised visible–infrared person re-identification","authors":"Jifeng Guo ,&nbsp;Zhiqi Pang","doi":"10.1016/j.imavis.2025.105520","DOIUrl":"10.1016/j.imavis.2025.105520","url":null,"abstract":"<div><div>Visible–infrared person re-identification (VI-ReID) focuses on matching infrared and visible images of the same person. To reduce labeling costs, unsupervised VI-ReID (UVI-ReID) methods typically use clustering algorithms to generate pseudo-labels and iteratively optimize the model based on these pseudo-labels. Although existing UVI-ReID methods have achieved promising performance, they often overlook the effectiveness of text semantics in inter-modality matching and modality-invariant feature learning. In this paper, we propose an image–text feature learning (ITFL) method, which not only leverages text semantics to enhance intra-modality identity-related learning but also incorporates text semantics into inter-modality matching and modality-invariant feature learning. Specifically, ITFL first performs modality-aware feature learning to generate pseudo-labels within each modality. Then, ITFL employs modality-invariant text modeling (MTM) to learn a text feature for each cluster in the visible modality, and utilizes inter-modality dual-semantics matching (IDM) to match inter-modality positive clusters. To obtain modality-invariant and identity-related image features, we not only introduce a cross-modality contrastive loss in ITFL to mitigate the impact of modality gaps, but also develop a text semantic consistency loss to further promote modality-invariant feature learning. Extensive experimental results on VI-ReID datasets demonstrate that ITFL not only outperforms existing unsupervised methods but also competes with some supervised approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105520"},"PeriodicalIF":4.2,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143724889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic review of intermediate fusion in multimodal deep learning for biomedical applications 生物医学应用中多模态深度学习中间融合的系统综述
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-25 DOI: 10.1016/j.imavis.2025.105509
Valerio Guarrasi , Fatih Aksu , Camillo Maria Caruso , Francesco Di Feola , Aurora Rofena , Filippo Ruffini , Paolo Soda
{"title":"A systematic review of intermediate fusion in multimodal deep learning for biomedical applications","authors":"Valerio Guarrasi ,&nbsp;Fatih Aksu ,&nbsp;Camillo Maria Caruso ,&nbsp;Francesco Di Feola ,&nbsp;Aurora Rofena ,&nbsp;Filippo Ruffini ,&nbsp;Paolo Soda","doi":"10.1016/j.imavis.2025.105509","DOIUrl":"10.1016/j.imavis.2025.105509","url":null,"abstract":"<div><div>Deep learning has revolutionized biomedical research by providing sophisticated methods to handle complex, high-dimensional data. Multimodal deep learning (MDL) further enhances this capability by integrating diverse data types such as imaging, textual data, and genetic information, leading to more robust and accurate predictive models. In MDL, differently from early and late fusion methods, intermediate fusion stands out for its ability to effectively combine modality-specific features during the learning process. This systematic review comprehensively analyzes and formalizes current intermediate fusion methods in biomedical applications, highlighting their effectiveness in improving predictive performance and capturing complex inter-modal relationships. We investigate the techniques employed, the challenges faced, and potential future directions for advancing intermediate fusion methods. Additionally, we introduce a novel structured notation that standardizes intermediate fusion architectures, enhancing understanding and facilitating implementation across various domains. Our findings provide actionable insights and practical guidelines intended to support researchers, healthcare professionals, and the broader deep learning community in developing more sophisticated and insightful multimodal models. Through this review, we aim to provide a foundational framework for future research and practical applications in the dynamic field of MDL.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105509"},"PeriodicalIF":4.2,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusing grid and adaptive region features for image captioning 融合网格和自适应区域特征的图像字幕
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-24 DOI: 10.1016/j.imavis.2025.105513
Jiahui Wei , Zhixin Li , Canlong Zhang , Huifang Ma
{"title":"Fusing grid and adaptive region features for image captioning","authors":"Jiahui Wei ,&nbsp;Zhixin Li ,&nbsp;Canlong Zhang ,&nbsp;Huifang Ma","doi":"10.1016/j.imavis.2025.105513","DOIUrl":"10.1016/j.imavis.2025.105513","url":null,"abstract":"<div><div>Image captioning aims to automatically generate grammatically correct and reasonable description sentences for given images. Improving feature optimization and processing is crucial for enhancing performance in this task. A common approach is to leverage the complementary advantages of grid features and region features. However, incorporating region features in most current methods may lead to incorrect guidance during training, along with high acquisition costs and the requirement of pre-caching. These factors impact the effectiveness and practical application of image captioning. To address these limitations, this paper proposes a method called fusing grid and adaptive region features for image captioning (FGAR). FGAR dynamically explores pseudo-region information within a given image based on the extracted grid features. Subsequently, it utilizes a combination of computational layers with varying permissions to fuse features, enabling comprehensive interaction between information from different modalities while preserving the unique characteristics of each modality. The resulting enhanced visual features provide improved support to the decoder for autoregressively generating sentences describing the content of a given image. All processes are integrated within a fully end-to-end framework, facilitating both training and inference processes while achieving satisfactory performance. Extensive experiments validate the effectiveness of the proposed FGAR method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105513"},"PeriodicalIF":4.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stealth sight: A multi perspective approach for camouflaged object detection 隐身瞄准:一种多视角的伪装目标检测方法
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-24 DOI: 10.1016/j.imavis.2025.105517
Domnic S., Jayanthan K.S.
{"title":"Stealth sight: A multi perspective approach for camouflaged object detection","authors":"Domnic S.,&nbsp;Jayanthan K.S.","doi":"10.1016/j.imavis.2025.105517","DOIUrl":"10.1016/j.imavis.2025.105517","url":null,"abstract":"<div><div>Camouflaged object detection (COD) is a challenging task due to the inherent similarity between objects and their surroundings. This paper introduces <strong>Stealth Sight</strong>, a novel framework integrating multi-view feature fusion and depth-based refinement to enhance segmentation accuracy. Our approach incorporates a pretrained multi-view CLIP encoder and a depth extraction network, facilitating robust feature representation. Additionally, we introduce a cross-attention transformer decoder and a post-training pruning mechanism to improve efficiency. Extensive evaluations on benchmark datasets demonstrate that Stealth Sight outperforms state-of-the-art methods in camouflaged object segmentation. Our method significantly enhances detection in complex environments, making it applicable to medical imaging, security, and wildlife monitoring.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105517"},"PeriodicalIF":4.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advanced deep learning and large language models: Comprehensive insights for cancer detection 先进的深度学习和大型语言模型:全面洞察癌症检测
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-24 DOI: 10.1016/j.imavis.2025.105495
Yassine Habchi , Hamza Kheddar , Yassine Himeur , Adel Belouchrani , Erchin Serpedin , Fouad Khelifi , Muhammad E.H. Chowdhury
{"title":"Advanced deep learning and large language models: Comprehensive insights for cancer detection","authors":"Yassine Habchi ,&nbsp;Hamza Kheddar ,&nbsp;Yassine Himeur ,&nbsp;Adel Belouchrani ,&nbsp;Erchin Serpedin ,&nbsp;Fouad Khelifi ,&nbsp;Muhammad E.H. Chowdhury","doi":"10.1016/j.imavis.2025.105495","DOIUrl":"10.1016/j.imavis.2025.105495","url":null,"abstract":"<div><div>In recent years, the rapid advancement of machine learning (ML), particularly deep learning (DL), has revolutionized various fields, with healthcare being one of the most notable beneficiaries. DL has demonstrated exceptional capabilities in addressing complex medical challenges, including the early detection and diagnosis of cancer. Its superior performance, surpassing both traditional ML methods and human accuracy, has made it a critical tool in identifying and diagnosing diseases such as cancer. Despite the availability of numerous reviews on DL applications in healthcare, a comprehensive and detailed understanding of DL’s role in cancer detection remains lacking. Most existing studies focus on specific aspects of DL, leaving significant gaps in the broader knowledge base. This paper aims to bridge these gaps by offering a thorough review of advanced DL techniques, namely transfer learning (TL), reinforcement learning (RL), federated learning (FL), Transformers, and large language models (LLMs). These cutting-edge approaches are pushing the boundaries of cancer detection by enhancing model accuracy, addressing data scarcity, and enabling decentralized learning across institutions while maintaining data privacy. TL enables the adaptation of pre-trained models to new cancer datasets, significantly improving performance with limited labeled data. RL is emerging as a promising method for optimizing diagnostic pathways and treatment strategies, while FL ensures collaborative model development without sharing sensitive patient data. Furthermore, Transformers and LLMs, traditionally utilized in natural language processing (NLP), are now being applied to medical data for enhanced interpretability and context-based predictions. In addition, this review explores the efficiency of the aforementioned techniques in cancer diagnosis, it addresses key challenges such as data imbalance, and proposes potential solutions. It aims to be a valuable resource for researchers and practitioners, offering insights into current trends and guiding future research in the application of advanced DL techniques for cancer detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105495"},"PeriodicalIF":4.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143704454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification 基于视频的视红外人再识别的时空信息挖掘与融合特征引导模态对齐
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-22 DOI: 10.1016/j.imavis.2025.105518
Zhigang Zuo, Huafeng Li, Yafei Zhang, Minghong Xie
{"title":"Spatio-temporal information mining and fusion feature-guided modal alignment for video-based visible-infrared person re-identification","authors":"Zhigang Zuo,&nbsp;Huafeng Li,&nbsp;Yafei Zhang,&nbsp;Minghong Xie","doi":"10.1016/j.imavis.2025.105518","DOIUrl":"10.1016/j.imavis.2025.105518","url":null,"abstract":"<div><div>The video-based visible-infrared person re-identification (Re-ID) aims to recognize the same person across modalities through video sequences. The core challenges of this task lie in narrowing the modal differences and deeply mining the rich spatio-temporal information contained in video to enhance model performance. However, existing research primarily focuses on addressing the modality gap, with insufficient utilization of the spatio-temporal information in video sequences. To address this, this paper proposes a novel spatio-temporal information mining and fusion feature-guided modal alignment framework for video-based visible-infrared person Re-ID. Specifically, we propose a spatio-temporal information mining method. This method employs the proposed feature correlation mechanism to enhance the discriminative features of person across different frames, while utilizing a temporal Transformer to mine person motion features. The advantage of this method lies in its ability to alleviate issues such as occlusion and frame misalignment, improving the discriminability of person features. Additionally, we introduce a fusion modality-guided modal alignment strategy, which reduces modality differences between infrared and visible video frames by aligning single-modality features with fusion features. The advantage of this strategy is that each modality not only learns its specific features but also absorbs person information from the other modality, thereby alleviating modality differences and further enhancing the discriminability of person features. Extensive comparative and ablation experiments conducted on the HITSZ-VCM and BUPTCampus datasets confirm the effectiveness and superiority of the proposed framework. The source code is available at <span><span>https://github.com/lhf12278/SIMFGA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105518"},"PeriodicalIF":4.2,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143697064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MFKD: Multi-dimensional feature alignment for knowledge distillation 知识蒸馏的多维特征对齐
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-22 DOI: 10.1016/j.imavis.2025.105514
Zhen Guo , Pengzhou Zhang , Peng Liang
{"title":"MFKD: Multi-dimensional feature alignment for knowledge distillation","authors":"Zhen Guo ,&nbsp;Pengzhou Zhang ,&nbsp;Peng Liang","doi":"10.1016/j.imavis.2025.105514","DOIUrl":"10.1016/j.imavis.2025.105514","url":null,"abstract":"<div><div>Knowledge distillation is a popular technique for compressing and transferring models in the field of deep learning. However, existing distillation methods often focus on optimizing a single dimension and overlook the importance of aligning and transforming knowledge across multiple dimensions, leading to suboptimal results. In this article, we introduce a novel approach called multi-dimensional feature alignment for knowledge distillation (MFKD) to address this limitation. The MFKD framework is built on the observation that knowledge from different dimensions can complement each other effectively. We extract knowledge from features in the spatcial, sample and channel dimensions separately. Our spatial-level part separates the foreground and background information, guiding the student to focus on crucial image regions by mimicking the teacher’s spatial and channel attention maps. Our sample-level part distills knowledge encoded in semantic correlations between sample activations by aligning the student’s activations to emulate the teacher’s clustering patterns using the Spearman correlation coefficient. Furthermore, our channel-level part encourages the student to learn standardized feature representations aligned with the teacher’s channel-wise interdependencies. Finally, we dynamically balance the loss factors of the different dimensions to optimize the overall performance of the distillation process. To validate the effectiveness of our methodology, we conduct experiments on benchmark datasets such as CIFAR-100, ImageNet and COCO. The experimental results demonstrate substantial performance improvements compared to baseline and recent state-of-the-art methods, confirming the efficacy of our MFKD framework. Furthermore, we provide a comprehensive analysis of the experimental results, offering deeper insight into the benefits and effectiveness of our approach. Through this analysis, we reinforce the significance of aligning and leveraging knowledge across multiple dimensions in knowledge distillation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105514"},"PeriodicalIF":4.2,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention head purification: A new perspective to harness CLIP for domain generalization 注意力头部净化:利用CLIP进行领域泛化的新视角
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-22 DOI: 10.1016/j.imavis.2025.105511
Yingfan Wang, Guoliang Kang
{"title":"Attention head purification: A new perspective to harness CLIP for domain generalization","authors":"Yingfan Wang,&nbsp;Guoliang Kang","doi":"10.1016/j.imavis.2025.105511","DOIUrl":"10.1016/j.imavis.2025.105511","url":null,"abstract":"<div><div>Domain Generalization (DG) aims to learn a model from multiple source domains to achieve satisfactory performance on unseen target domains. Recent works introduce CLIP to DG tasks due to its superior image-text alignment and zeros-shot performance. Previous methods either utilize full fine-tuning or prompt-learning paradigms to harness CLIP for DG tasks. Those works focus on avoiding catastrophic forgetting of the original knowledge encoded in CLIP but ignore that the knowledge encoded in CLIP in nature may contain domain-specific cues that constrain its domain generalization performance. In this paper, we propose a new perspective to harness CLIP for DG, <em>i.e.,</em> attention head purification. We observe that different attention heads may encode different properties of an image and selecting heads appropriately may yield remarkable performance improvement across domains. Based on such observations, we purify the attention heads of CLIP from two levels, including <em>task-level purification</em> and <em>domain-level purification</em>. For task-level purification, we design head-aware LoRA to make each head more adapted to the task we considered. For domain-level purification, we perform head selection via a simple gating strategy. We utilize MMD loss to encourage masked head features to be more domain-invariant to emphasize more generalizable properties/heads. During training, we jointly perform task-level purification and domain-level purification. We conduct experiments on various representative DG benchmarks. Though simple, extensive experiments demonstrate that our method performs favorably against previous state-of-the-arts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105511"},"PeriodicalIF":4.2,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信