{"title":"Two-stage attribute-guided dual attention network for fine-grained fashion retrieval","authors":"Bo Pan, Jun Xiang, Ning Zhang, Ruru Pan","doi":"10.1016/j.cviu.2025.104497","DOIUrl":"10.1016/j.cviu.2025.104497","url":null,"abstract":"<div><div>Fine-grained clothing retrieval is essential for intelligent shopping and personalized recommendation systems. However, conventional methods often fail to capture subtle attribute variations. This paper proposes a novel two-stage attribute-guided dual attention network. The network combines global and local feature extraction with Attribute-aware Multi-Scale Spatial Attention (AMSA) and Attribute-guided Dynamic Channel Attention (ADCA). AMSA captures attribute-specific spatial details at multiple scales, while ADCA dynamically adjusts channel importance based on attribute embeddings, enabling precise attribute-level similarity modeling. A multi-level joint loss function further optimizes both global and local representations and enhances feature alignment. Experiments on FashionAI and the self-built FGDress dataset show that the proposed method achieves mAP scores of 66.01% and 73.98%, respectively, outperforming baseline approaches. Attribute-level analysis confirms robust recognition of both well-defined and challenging attributes. These results validate the practicality and generalizability of the proposed framework, with promising applications in personalized recommendation, fashion trend analysis, and design evaluation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104497"},"PeriodicalIF":3.5,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JSF: A joint spatial-frequency domain network for low-light image enhancement","authors":"Yahong Wu , Feng Liu , Rong Wang","doi":"10.1016/j.cviu.2025.104496","DOIUrl":"10.1016/j.cviu.2025.104496","url":null,"abstract":"<div><div>The enhancement of low-light images remains a prominent focus in the field of image processing. The degree of lightness significantly influences vision-based intelligent recognition and analysis. Departing from conventional methods, this paper proposes an innovative joint spatial-frequency domain network for low-light image enhancement, referred to as JSF. In the spatial domain, brightness is optimized through the amalgamation of global and local information. In the frequency domain, noise is reduced and details are amplified using Fourier Transformation to carry out amplitude and phase enhancement. Additionally, the enhanced results from the aforementioned domains are fused by linear and nonlinear stretching. To validate the effectiveness of JSF, this paper presents both qualitative and quantitative comparison results, demonstrating its superiority over several existing state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104496"},"PeriodicalIF":3.5,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huan Zhang , Xu Zhang , Nian Cai , Jianglei Di , Yun Zhang
{"title":"Joint multi-dimensional dynamic attention and transformer for general image restoration","authors":"Huan Zhang , Xu Zhang , Nian Cai , Jianglei Di , Yun Zhang","doi":"10.1016/j.cviu.2025.104491","DOIUrl":"10.1016/j.cviu.2025.104491","url":null,"abstract":"<div><div>Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder–decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at <span><span>https://github.com/House-yuyu/MDDA-former</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104491"},"PeriodicalIF":3.5,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Camera pose in SfT and NRSfM under isometric and weaker deformation models","authors":"Adrien Bartoli , Agniva Sengupta","doi":"10.1016/j.cviu.2025.104488","DOIUrl":"10.1016/j.cviu.2025.104488","url":null,"abstract":"<div><div>Camera pose is a very natural concept in 3D vision in the rigid setting. It is however much more difficult to work with in deformable settings. Consequently, numerous deformable reconstruction methods simply ignore camera pose. We analyse the concept of pose in deformable settings and prove that it is unconstrained with the existing formulations, properly justifying the existing pose-less methods reconstructing structure only. We explain this result intuitively by the impossibility to define an intrinsic coordinate frame to a general deforming object. The proposed analysis uses the isometric deformation model and extends to the weaker models including conformality and equiareality We propose a novel prior to rescue camera pose estimation in deformable settings, which attributes the deforming object’s dominant rigid-body motion to the camera. We show that adding this prior to any existing formulation fully constrains camera pose and leads to elegant two-step solution methods, involving deformable structure reconstruction using a base method in the first step, and absolute orientation or Procrustes analysis in the second step. We derive the proposed approach for the template-based and template-less settings, respectively implemented using Shape-from-Template (SfT) and Non-Rigid Structure-from-Motion (NRSfM) as base methods and validate them experimentally, showing that the computed pose is qualitatively and quantitatively plausible.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104488"},"PeriodicalIF":3.5,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145050244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FCNet: A feature complementary network for nighttime flare removal","authors":"Kejing Qi , Bo Wang , Chongyi Li","doi":"10.1016/j.cviu.2025.104495","DOIUrl":"10.1016/j.cviu.2025.104495","url":null,"abstract":"<div><div>Nighttime image flare removal is a very challenging task due to the presence of various types of unfavorable degrading effects, including glare, shimmer, streak and saturated blobs. Most of the existing methods focus on the spatial domain and limited perception field, resulting in incomplete flare removal and severe artifacts. To address these challenges, we propose a two-stage feature complementary network for nighttime flare removal, which is used for flare perception and removal, respectively. In the first stage, a Spatial-Frequency Complementary Module (SFCM) is designed to perceive the flare region from different domains to get a mask of the flare. In the second stage, the flare mask and image are fed into the Spatial-Frequency Complementary Gating Module (SFCGM) to preserve the background information, while removing the flares from different angles and restoring the detailed features. Finally the flare and non-flare regions are modeled by the Flare Interactive Module (FIM) to refine the flare regions at a fine-grained level to suppress the artifact problem. Extensive experiments on Flare 7K++ validate the superiority of the proposed approach over state-of-the-arts, both qualitatively and quantitatively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104495"},"PeriodicalIF":3.5,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145050245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqing Wu , Xinyi Shi , Jinghai Ai , Xiaodong Wang
{"title":"GRE-Net: A forgery image detection framework based on gradient feature and reconstruction error","authors":"Wenqing Wu , Xinyi Shi , Jinghai Ai , Xiaodong Wang","doi":"10.1016/j.cviu.2025.104494","DOIUrl":"10.1016/j.cviu.2025.104494","url":null,"abstract":"<div><div>With the continuous technological breakthroughs in Generative Adversarial Networks (GANs) and diffusion models, remarkable progress has been achieved in the field of image generation. These technologies enable the creation of highly realistic images, thereby intensifying the risk of spreading fake information. However, traditional image detectors face a growing challenge of inadequate generalization capabilities when confronted with images generated by models that were not included during the training phase. To tackle this challenge, we introduce a novel detection framework, named GRE-Net (Network integrating Gradient and Reconstruction Error), which extracts gradient feature through the DPG module and calculates the reconstruction error utilizing the DIRE method. By integrating these two aspects into a comprehensive feature representation, GRE-Net effectively detects the authenticity of images. Specifically, we devise a dual-branch model that leverages the proposed DPG (Discriminator of ProjectedGAN to extract Gradient) module to extract gradient feature from images and concurrently employs the DIRE (DIffusion Reconstruction Error) method to obtain the diffusion reconstruction error of images. By fusing the features extracted from these two modules as a universal representation, we describe the artifacts produced by generative models, crafting a comprehensive detector capable of identifying both GAN-generated and diffusion model-generated images. Notably, the DPG approach utilizes the discriminator of ProjectedGAN as an intermediary bridge, mapping all data into the gradient domain. This transformation process effectively captures the intrinsic feature differences during the image generation process. Subsequently, the gradient feature are fed into a classifier to achieve efficient discrimination between authentic and fake images. To validate the efficacy of our proposed detector, we conducted evaluations on a dataset comprising images generated by ten diverse diffusion models and GANs. Extensive experiments demonstrate that our detector exhibits stronger generalization capabilities and higher robustness, rendering it suitable for real-world generated image detection tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104494"},"PeriodicalIF":3.5,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145027814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiale Cao, Yuanheng Liu, Zhong Ji, Jingren Liu, Aiping Yang, Yanwei Pang
{"title":"Mitigating forgetting in the adaptation of CLIP for few-shot classification","authors":"Jiale Cao, Yuanheng Liu, Zhong Ji, Jingren Liu, Aiping Yang, Yanwei Pang","doi":"10.1016/j.cviu.2025.104493","DOIUrl":"10.1016/j.cviu.2025.104493","url":null,"abstract":"<div><div>Adapter-style efficient transfer learning has demonstrated outstanding performance in fine-tuning vision-language models, especially in scenarios with limited data. However, existing methods fail to effectively balance the prior knowledge acquired during the pre-training process and the training samples. To address this problem, we propose a method called Mitigating Forgetting in the Adaptation (MiFA) of CLIP. MiFA first employs class prototypes to represent the most prominent features of a class, and these prototypes provide a robust initialization for the classifier. To overcome the forgetting of prior knowledge, MiFA then leverages a memory module that retains the initial parameters and the parameters of training history by creating a memory weight through momentum. The weight is used to initialize a new classification layer, which, along with the original layer, guides each other to balance prior knowledge and feature adaptation. Similarly, in the text processing branch, a parallel initialization strategy is adopted to ensure that the model’s performance is improved. Text features are employed to initialize a text classification layer, and CLIP logits help prevent excessive forgetting of useful text information. Extensive experiments have demonstrated the effectiveness of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104493"},"PeriodicalIF":3.5,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchen Zheng, Yuxin Jing, Jufeng Zhao, Guangmang Cui
{"title":"LAM-YOLO: Drones-based small object detection on lighting-occlusion attention mechanism YOLO","authors":"Yuchen Zheng, Yuxin Jing, Jufeng Zhao, Guangmang Cui","doi":"10.1016/j.cviu.2025.104489","DOIUrl":"10.1016/j.cviu.2025.104489","url":null,"abstract":"<div><div>Drone-based target detection presents inherent challenges, including the high density and overlap of targets in drone images, as well as the blurriness of targets under varying lighting conditions, which complicates accurate identification. Traditional methods often struggle to detect numerous small, densely packed targets against complex backgrounds. To address these challenges, we propose LAM-YOLO, an object detection model specifically designed for drone-based applications. First, we introduce a light-occlusion attention mechanism to enhance the visibility of small targets under diverse lighting conditions. Additionally, we incorporate Involution modules to improve feature layer interactions. Second, we employ an improved SIB-IoU as the regression loss function to accelerate model convergence and enhance localization accuracy. Finally, we implement a novel detection strategy by introducing two auxiliary detection heads to better identify smaller-scale targets. Our quantitative results demonstrate that LAM-YOLO outperforms methods such as Faster R-CNN, YOLOv11, and YOLOv12 in terms of [email protected] and [email protected]:0.95 on the VisDrone2019 public dataset. Compared to the original YOLOv8, the average precision increases by 7.1%. Additionally, the proposed SIB-IoU loss function not only accelerates convergence speed during training but also improves average precision compared to the traditional loss function.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104489"},"PeriodicalIF":3.5,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoxuan Wu , Lai-Man Po , Xuyuan Xu , Kun Li , Yuyang Liu , Zeyu Jiang
{"title":"Comprehensive regional guidance for attention map semantics in text-to-image diffusion models","authors":"Haoxuan Wu , Lai-Man Po , Xuyuan Xu , Kun Li , Yuyang Liu , Zeyu Jiang","doi":"10.1016/j.cviu.2025.104492","DOIUrl":"10.1016/j.cviu.2025.104492","url":null,"abstract":"<div><div>Diffusion models have shown remarkable success in image generation tasks. However, accurately interpreting and translating the semantic meaning of input text into coherent visuals remains a significant challenge. We observe that existing approaches often rely on enhancing attention maps in a pixel-based or patch-based manner, which can lead to issues such as non-contiguous regions, unintended region leakage, eventually causing attention maps with limited semantic richness, degrade output quality. To address these limitations, we propose CoRe Diffusion, a novel method that provides comprehensive regional guidance throughout the generation process. Our approach introduces a region-assignment mechanism coupled with a tailored optimization strategy, enabling attention maps to better capture and express semantic information of concepts. Additionally, we incorporate mask guidance during the denoising steps to mitigate region leakage. Through extensive comparisons with state-of-the-art methods and detailed visual analyses, we demonstrate that our approach achieves superior performance, offering a more faithful image generation framework with semantically accurate procedure. Furthermore, our framework offers flexibility by supporting both automatic region assignment and user-defined spatial inputs as conditional guidance, enhancing its adaptability for diverse applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104492"},"PeriodicalIF":3.5,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145004208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge-aware graph reasoning network for image manipulation localization","authors":"Ruyi Bai","doi":"10.1016/j.cviu.2025.104490","DOIUrl":"10.1016/j.cviu.2025.104490","url":null,"abstract":"<div><div>Convolution networks continue to be the dominant approach in current research on image manipulation localization. However, their inherent sensory field limitation results in the network primarily focusing on local feature extraction, which largely ignores the importance of long-range contextual information. To overcome this limitation, researchers have proposed methods such as multi-scale feature fusion, self-attention mechanisms, and pyramid pooling. Nevertheless, these methods frequently encounter difficulties in feature coupling between tampered and non-tampered regions in practice, which constrains the model performance. To address these issues, we propose an innovative edge-aware graph reasoning network (EGRNet). The core advantage of this network is that it can effectively enhance the similarity of features within the tampered region while weakening the information interaction between the tampered region and non-tampered region, thus enabling the precise localization of the tampered region. The network employs dual-stream encoders, comprising an RGB encoder and an SRM encoder, to extract visual and noise features, respectively. It then utilizes spatial pyramid graph reasoning to fuse features at different scales. The graph reasoning architecture comprises two components: the Cross Graph Convolution Feature Fusion Module (CGCFFM) and the Edge-aware Graph Attention Module (EGAM). CGCFFM realizes adaptive fusion of visual and noise features by performing cross spatial graph convolution and channel graph convolution operations. The integration of two-stream features is facilitated by graph convolution, which enables the mapping of the features to a novel low-dimensional space. This space is capable of modeling long-range contextual relationships. EGAM applies binary edge information to the graph attention adjacency matrix. This enables the network to explicitly model the inconsistency between the tampered region and non-tampered region. EGAM effectively suppresses the interference of the non-tampered region to the tampered region, thereby improves the network’s ability to recognize tampered regions. To validate the performance of EGRNet, extensive experiments are conducted on several challenging benchmark datasets. The experimental results indicate that our method outperforms current state-of-the-art image manipulation localization methods in both qualitative and quantitative evaluations. Furthermore, EGRNet demonstrated strong adaptability to different types of tampering and robustness to various attacks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104490"},"PeriodicalIF":3.5,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145018653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}