{"title":"Attention head purification: A new perspective to harness CLIP for domain generalization","authors":"Yingfan Wang, Guoliang Kang","doi":"10.1016/j.imavis.2025.105511","DOIUrl":"10.1016/j.imavis.2025.105511","url":null,"abstract":"<div><div>Domain Generalization (DG) aims to learn a model from multiple source domains to achieve satisfactory performance on unseen target domains. Recent works introduce CLIP to DG tasks due to its superior image-text alignment and zeros-shot performance. Previous methods either utilize full fine-tuning or prompt-learning paradigms to harness CLIP for DG tasks. Those works focus on avoiding catastrophic forgetting of the original knowledge encoded in CLIP but ignore that the knowledge encoded in CLIP in nature may contain domain-specific cues that constrain its domain generalization performance. In this paper, we propose a new perspective to harness CLIP for DG, <em>i.e.,</em> attention head purification. We observe that different attention heads may encode different properties of an image and selecting heads appropriately may yield remarkable performance improvement across domains. Based on such observations, we purify the attention heads of CLIP from two levels, including <em>task-level purification</em> and <em>domain-level purification</em>. For task-level purification, we design head-aware LoRA to make each head more adapted to the task we considered. For domain-level purification, we perform head selection via a simple gating strategy. We utilize MMD loss to encourage masked head features to be more domain-invariant to emphasize more generalizable properties/heads. During training, we jointly perform task-level purification and domain-level purification. We conduct experiments on various representative DG benchmarks. Though simple, extensive experiments demonstrate that our method performs favorably against previous state-of-the-arts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105511"},"PeriodicalIF":4.2,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Wang , Yun Tie , Dalong Zhang , Fenghui Liu , Lin Qi
{"title":"MrgaNet: Multi-scale recursive gated aggregation network for tracheoscopy images","authors":"Ying Wang , Yun Tie , Dalong Zhang , Fenghui Liu , Lin Qi","doi":"10.1016/j.imavis.2025.105503","DOIUrl":"10.1016/j.imavis.2025.105503","url":null,"abstract":"<div><div>Lung cancer is a potentially fatal disease worldwide, and improving the accuracy of diagnosis plays a key role in enhancing patient outcomes. In this study, we extended computer-aided work to the task of assisting tracheoscopy in predicting lung cancer subtypes. To solve the problem of information fusion in different spatial scales and channels, we proposed MrgaNet. The network enhances classification performance by expanding interactions from low to high orders, dynamically adjusting feature weights, and incorporating a channel competition operator for efficient feature selection. Our network achieved a precision of 0.87 in the endobronchial dataset. In addition, the accuracy of 89.25% and 96.76% was achieved in the Kvasir-v2 dataset and the Kvasir-Capsule dataset, respectively. The results demonstrate that MrgaNet achieves superior performance compared to existing excellent methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105503"},"PeriodicalIF":4.2,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143715155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Part-aware distillation and aggregation network for human parsing","authors":"Yuntian Lai, Yuxin Feng, Fan Zhou, Zhuo Su","doi":"10.1016/j.imavis.2025.105504","DOIUrl":"10.1016/j.imavis.2025.105504","url":null,"abstract":"<div><div>The current state-of-the-art human parsing models achieve remarkable success in parsing accuracy. However, the huge model size and computational cost restrict their applications on low-latency online systems or resource-limited mobile devices. In this paper, we propose a novel part-aware distillation and aggregation network for human parsing, which can be applied to any human parsing model to achieve a good trade-off between accuracy and efficiency. We design the part key-point similarity distillation and the part distribution distillation to transfer the complex teacher model’s knowledge of part structural and spatial relationships to the lightweight student model, which can help the latter to better identify small parts and semantic boundaries, and to distinguish easily confused categories. Furthermore, the online model aggregation module is introduced in the later stages of training, which can mitigate noise from both the teacher and the labels to obtain smoother and more robust results. Extensive experiments and ablation studies on the large-scale popular human parsing datasets LIP, ATR and PASCAL-Person Part fully demonstrate that our method is accurate, lightweight and general.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105504"},"PeriodicalIF":4.2,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DDMCB: Open-world object detection empowered by Denoising Diffusion Models and Calibration Balance","authors":"Yangyang Huang, Xing Xi, Ronghua Luo","doi":"10.1016/j.imavis.2025.105508","DOIUrl":"10.1016/j.imavis.2025.105508","url":null,"abstract":"<div><div>Open-world object detection (OWOD) differs from traditional object detection by being more suited to real-world, dynamic scenarios. It aims to recognize unseen objects and have the skill to learn incrementally based on newly introduced knowledge. However, the current OWOD usually relies on supervising of known objects in identifying unknown objects, using high objectness scores as critical indicators of potential unknown objects. While these methods can detect unknown objects with features similar to known objects, they also classify regions dissimilar to known objects as background, leading to label bias issues. To address this problem, we leverage the knowledge from large visual models to provide auxiliary supervision for unknown objects. Additionally, we apply the Denoising Diffusion Probabilistic Model (DDPM) in OWOD scenarios. We propose an unsupervised modeling approach based on DDPM, which significantly improves the accuracy of unknown object detection. Despite this, the classifier trained during the model training process only encounters known classes, resulting in higher confidence for known classes during inference; thus, bias issues again occur. Therefore, we propose a probability calibration technique for post-processing predictions during inference. The calibration aims to reduce the probabilities of known objects and increase the probabilities of unknown objects, thereby balancing the final probability predictions. Our experiments demonstrate that the proposed method achieves significant improvements on OWOD benchmarks, with an unknown objects detection recall rate of <strong>54.7 U-Recall</strong>, surpassing the current state-of-the-art (SOTA) methods by <strong>44.3%</strong>. In terms of real-time performance, Our model uses a few parameters, and pure convolutional neural networks instead of intensive attention mechanisms, achieving an inference speed of <strong>35.04 FPS</strong>, exceeding the SOTA OWOD methods based on Faster R-CNN and Deformable DETR by <strong>2.79</strong> and <strong>10.95 FPS</strong>, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105508"},"PeriodicalIF":4.2,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong
{"title":"Self-supervised monocular depth learning from unknown cameras: Leveraging the power of raw data","authors":"Xiaofei Qin , Yongchao Zhu , Lin Wang , Xuedian Zhang , Changxiang He , Qiulei Dong","doi":"10.1016/j.imavis.2025.105505","DOIUrl":"10.1016/j.imavis.2025.105505","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation from wild videos with unknown camera intrinsics is a practical and challenging task in computer vision. Most of the existing methods in literature employed a camera decoder and a pose decoder to estimate camera intrinsics and poses respectively, however, their performances would be degraded significantly in many complex scenarios with severe noise and large camera rotations. To address this problem, we propose a novel self-supervised monocular depth estimation method, which could be trained from wild videos with a joint optimization strategy for simultaneously estimating camera intrinsics and poses. In the proposed method, a depth encoder is employed to learn scene depth features, and then by taking these features as inputs, a Neighborhood Influence Module (NIM) is designed for predicting each pixel’s depth by fusing the depths of its neighboring pixels, which could explicitly enforce the depth accuracy. In addition, a knowledge distillation mechanism is introduced to learn a lightweight depth encoder from a large-scale depth encoder, for achieving a balance between computational speed and accuracy. Experimental results on four public datasets demonstrate that the proposed method outperforms some state-of-the-art methods in most cases. Moreover, once the proposed method is trained with a mixed set of different datasets, its performance would be further boosted in comparison to the proposed method trained with each involved single dataset. Codes are available at: <span><span>https://github.com/ZhuYongChaoUSST/IntrLessMonoDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105505"},"PeriodicalIF":4.2,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rif-Diff: Improving image fusion based on diffusion model via residual prediction","authors":"Peixuan Wu, Shen Yang, Jin Wu, Qian Li","doi":"10.1016/j.imavis.2025.105494","DOIUrl":"10.1016/j.imavis.2025.105494","url":null,"abstract":"<div><div>This paper proposes an image fusion framework Rif-Diff, which adopts several strategies and approaches to improve current fusion methods based on diffusion model. Rif-Diff employs residual images as the generation target of the diffusion model to optimize the model’s convergence process and enhance the fusion performance. For fusion tasks lacking ground truth, image fusion prior is utilized to facilitate the production of residual images. Simultaneously, to overcome the limitations of the model’s learning capacity imposed by training with image fusion prior, Rif-Diff introduces the idea of image restoration to enable the initial fused images to incorporate more expected information. Additionally, a dual-step decision module is designed to address the blurriness issue of fused images in existing multi-focus image fusion methods that do not rely on decision maps. Extensive experiments demonstrate the effectiveness of Rif-Diff across multiple fusion tasks including multi-focus image fusion, multi-exposure image fusion, and infrared-visible image fusion. The code is available at: <span><span>https://github.com/peixuanWu/Rif-Diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105494"},"PeriodicalIF":4.2,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Tian , Yanzhi Li , Jiangsen Yu , Junyu Shen , Weihua Ou
{"title":"Rethinking Active Domain Adaptation: Balancing Uncertainty and Diversity","authors":"Qing Tian , Yanzhi Li , Jiangsen Yu , Junyu Shen , Weihua Ou","doi":"10.1016/j.imavis.2025.105492","DOIUrl":"10.1016/j.imavis.2025.105492","url":null,"abstract":"<div><div>In applications of machine learning, usually the test data domain distributes inconsistently with the model training data, implying they are not independent and identically distributed. To address this challenge with certain annotation knowledge, the paradigm of Active Domain Adaptation (ADA) has been proposed through selectively labeling some target instances to facilitate cross-domain alignment with minimal annotation cost. However, existing ADA methods often struggle to balance uncertainty and diversity in sample selection, limiting their effectiveness. To address this, we propose a novel ADA framework: Balancing Uncertainty and Diversity (ADA-BUD), which desirably achieves ADA while balancing the data uncertainty and diversity across domains. Specifically, in ADA-BUD, the Uncertainty Range Perception (URA) module is specially designed to distinguish these most informative but uncertain target instances for annotation while appraising not only each instance itself but also their neighbors. Subsequently, the module called Representative Energy Optimization (REO) is constructed to refine diversity of the resulting annotation instances set. Last but not least, to enhance the flexibility of ADA-BUD in handling scenarios with limited data, we further build the Dynamic Sample Enhancement (DSE) module in ADA-BUD to generate class-balanced label-confident data augmentation. Experiments show ADA-BUD outperforms existing methods on challenging benchmarks, demonstrating its practical potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105492"},"PeriodicalIF":4.2,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Strengthening incomplete multi-view clustering: An attention contrastive learning method","authors":"Shudong Hou, Lanlan Guo, Xu Wei","doi":"10.1016/j.imavis.2025.105493","DOIUrl":"10.1016/j.imavis.2025.105493","url":null,"abstract":"<div><div>Incomplete multi-view clustering presents greater challenges than traditional multi-view clustering. In recent years, significant progress has been made in this field, multi-view clustering relies on the consistency and integrity of views to ensure the accurate transmission of data information. However, during the process of data collection and transmission, data loss is inevitable, leading to partial view loss and increasing the difficulty of joint learning on incomplete multi-view data. To address this issue, we propose a multi-view contrastive learning framework based on the attention mechanism. Previous contrastive learning mainly focused on the relationships between isolated sample pairs, which limited the robustness of the method. Our method selects positive samples from both global and local perspectives by utilizing the nearest neighbor graph to maximize the correlation between local features and latent features of each view. Additionally, we use a cross-view encoder network with self-attention structure to fuse the low dimensional representations of each view into a joint representation, and guide the learning of the joint representation through a high confidence structure. Furthermore, we introduce graph constraint learning to explore potential neighbor relationships among instances to facilitate data reconstruction. The experimental results on six multi-view datasets demonstrate that our method exhibits significant effectiveness and superiority compared to existing methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105493"},"PeriodicalIF":4.2,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-modal Few-shot Image Recognition with enhanced semantic and visual integration","authors":"Chunru Dong, Lizhen Wang, Feng Zhang, Qiang Hua","doi":"10.1016/j.imavis.2025.105490","DOIUrl":"10.1016/j.imavis.2025.105490","url":null,"abstract":"<div><div>Few-Shot Learning (FSL) enables models to recognize new classes with only a few examples by leveraging knowledge from known classes. Although some methods incorporate class names as prior knowledge, effectively integrating visual and semantic information remains challenging. Additionally, conventional similarity measurement techniques often result in information loss, obscure distinctions between samples, and fail to capture intra-sample diversity. To address these issues, this paper presents a Multi-modal Few-shot Image Recognition (MFSIR) approach. We first introduce the Multi-Scale Interaction Module (MSIM), which facilitates multi-scale interactions between semantic and visual features, significantly enhancing the representation of visual features. We also propose the Hybrid Similarity Measurement Module (HSMM), which integrates information from multiple dimensions to evaluate the similarity between samples by dynamically adjusting the weights of various similarity measurement methods, thereby improving the accuracy and robustness of similarity assessments. Experimental results demonstrate that our approach significantly outperforms existing methods on four FSL benchmarks, with marked improvements in FSL accuracy under 1-shot and 5-shot scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105490"},"PeriodicalIF":4.2,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Object tracking based on temporal and spatial context information","authors":"Yan Chen, Tao Lin, Jixiang Du, Hongbo Zhang","doi":"10.1016/j.imavis.2025.105488","DOIUrl":"10.1016/j.imavis.2025.105488","url":null,"abstract":"<div><div>Currently, numerous advanced trackers improve stability by optimizing the target visual appearance models or by improving interactions between templates and search areas. Despite these advancements, appearance-based trackers still primarily depend on the visual information of targets without adequately integrating spatio-temporal context information, thus limiting their effectiveness in handling similar objects around the target. To address this challenge, a novel object tracking method, TSCTrack, which leverages spatio-temporal context information, has been introduced. TSCTrack overcomes the shortcomings of traditional center-cropping preprocessing techniques by introducing Global Spatial Position Embedding, effectively preserving spatial information and capturing motion data of targets. Additionally, TSCTrack incorporates a Spatial Relationship Aggregation module and a Temporal Relationship Aggregation module—the former captures static spatial context information per frame, while the latter integrates dynamic temporal context information. This sophisticated integration allows the Dynamic Tracking Prediction module to generate precise target coordinates effectively, greatly reducing the impact of target deformations and scale changes on tracking performance. Demonstrated across multiple public tracking datasets including LaSOT, TrackingNet, UAV123, GOT-10k, and OTB, TSCTrack showcases superior performance and validates its exceptional tracking capabilities in diverse scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105488"},"PeriodicalIF":4.2,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143644198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}