Zukun Wan , Runmin Wang , Xingdong Song , Juan Xu , Xiaofei Cao , Jielei Hei , Shengrong Yuan , Yajun Ding , Changxin Gao
{"title":"Adaptive bias learning via gradient-based reweighting and constrained pruning for robust Visual Question Answering","authors":"Zukun Wan , Runmin Wang , Xingdong Song , Juan Xu , Xiaofei Cao , Jielei Hei , Shengrong Yuan , Yajun Ding , Changxin Gao","doi":"10.1016/j.cviu.2025.104484","DOIUrl":"10.1016/j.cviu.2025.104484","url":null,"abstract":"<div><div>Visual Question Answering (VQA) presents significant challenges in cross-modal reasoning due to susceptibility to dataset biases, spurious correlations, and shortcuts learning, which undermine model robustness. While ensemble methods mitigate bias via joint optimization of a bias model and a target model during training, their efficacy remains limited by suboptimal bias exploitation and model capacity imbalances. To address this, we propose the Adaptive Bias Learning Network (ABLNet), a novel framework that systematically enhances bias capture for improved generalization. Our approach introduces two key innovations: (1) Gradient-driven sample reweighting, which quantifies per-sample bias magnitude via training gradients and prioritizes low-bias samples to refine bias model training; (2) Constrained network pruning, deliberately restricting bias model capacity to amplify its focus on bias patterns. Extensive evaluations on VQA-CPv1, VQA-CPv2, and VQA-v2 benchmarks confirm our ABLNet’s superiority, demonstrating generalizability across diverse question types. The code will be released at <span><span>https://github.com/runminwang/ABLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104484"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masoumeh Javanbakhat , Md Tasnimul Hasan , Cristoph Lippert
{"title":"Multimodal vs. unimodal approaches to uncertainty in 3D image segmentation under distribution shifts","authors":"Masoumeh Javanbakhat , Md Tasnimul Hasan , Cristoph Lippert","doi":"10.1016/j.cviu.2025.104473","DOIUrl":"10.1016/j.cviu.2025.104473","url":null,"abstract":"<div><div>Machine learning has been widely adopted across sectors, yet its application in medical imaging remains challenging due to distribution shifts in real-world data. Deployed models often encounter samples that differ from the training dataset, particularly in the health domain, leading to performance issues. This limitation hinders the expressiveness and reliability of deep learning models in health applications. Thus, it becomes crucial to identify methods capable of producing reliable uncertainty estimation in the context of distribution shifts in the health sector. In this paper, we explore the feasibility of using cutting-edge Bayesian and non-Bayesian methods to detect distributionally shifted samples, aiming to achieve reliable and trustworthy diagnostic predictions in segmentation task. Specifically, we compare three distinct uncertainty estimation methods, each designed to capture either unimodal or multimodal aspects in the posterior distribution. Our findings demonstrate that methods capable of addressing <em>multimodal</em> characteristics in the posterior distribution, offer more dependable uncertainty estimates. This research contributes to enhancing the utility of deep learning in healthcare, making diagnostic predictions more robust and trustworthy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104473"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huakang Li , Yidan Qiu , Huimin Zhao , Jin Zhan , Rongjun Chen , Jinchang Ren , Ying Gao , Wing W.Y. Ng
{"title":"GaitBranch: A multi-branch refinement model combined with frame-channel attention mechanism for gait recognition","authors":"Huakang Li , Yidan Qiu , Huimin Zhao , Jin Zhan , Rongjun Chen , Jinchang Ren , Ying Gao , Wing W.Y. Ng","doi":"10.1016/j.cviu.2025.104463","DOIUrl":"10.1016/j.cviu.2025.104463","url":null,"abstract":"<div><div>Accurately representing human motion in video-based gait recognition is challenging due to the difficulty in obtaining an ideal gait silhouette sequence that captures comprehensive information. To address this challenge, we propose GaitBranch, a novel method that emphasizes local key information of human motion in different layers of the neural network. It divides the neural network into multiple branches using the multi-branch refinement (MBR) module and extracts local key frames from various body parts through the frame-channel attention mechanism (FCAM) to form a comprehensive representation of human motion patterns. GaitBranch achieves high gait recognition accuracy on the CASIA-B (98.6%, 96.1%, and 85.5% for normal walking, carrying a bag, and wearing a coat conditions), OU-MVLP (92.3%), and GREW (79.8%) datasets, demonstrating its robustness across different environments. Ablation experiments confirm the efficacy of our method and demonstrate that the performance gains result from the optimized model structure rather than simply increasing parameters.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104463"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unified learning for image–text alignment via multi-scale feature fusion","authors":"Jing Zhou , Meng Wang","doi":"10.1016/j.cviu.2025.104468","DOIUrl":"10.1016/j.cviu.2025.104468","url":null,"abstract":"<div><div>Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at <span><span>https://github.com/wangmeng-007/MATE/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104468"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144912623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meiqin Liu , Zilin Wang , Chao Yao , Yao Zhao , Wei Wang , Yunchao Wei
{"title":"Context perturbation: A Consistent alignment approach for Domain Adaptive Semantic Segmentation","authors":"Meiqin Liu , Zilin Wang , Chao Yao , Yao Zhao , Wei Wang , Yunchao Wei","doi":"10.1016/j.cviu.2025.104464","DOIUrl":"10.1016/j.cviu.2025.104464","url":null,"abstract":"<div><div>Domain Adaptive Semantic Segmentation (DASS) aims to adapt a pre-trained segmentation model from a labeled source domain to an unlabeled target domain. Previous approaches usually address the domain gap by consistency regularization which is implemented based on the augmented data. However, as the augmentations are often performed at the input level with simple linear transformations, the feature representations suffer limited perturbation from these augmented views. As a result, they are not effective for cross-domain consistency learning. In this work, we propose a new augmentation method, namely contextual augmentation, and combine it with contrastive learning approaches from both the pixel and class levels to achieve consistency regularization. We term this methodology as Context Perturbation for DASS (CoPDASeg). Specifically, contextual augmentation first combines domain information by class mix and then randomly crops two patches with an overlapping region. To achieve consistency regularization with the two augmented patches, we focus on both pixel and class perspectives and propose two parallel contrastive learning paradigms (<em>i.e.</em>, pixel-level contrastive learning and class-level contrastive learning). The former aligns the pixel-to-pixel feature representations, and later aligns class prototypes across domains. Experimental results on representative benchmarks (<em>i.e.</em>, <strong>GTA5</strong> <span><math><mo>→</mo></math></span><strong>Cityscapes</strong> and <strong>SYNTHIA</strong> <span><math><mo>→</mo></math></span> <strong>Cityscapes</strong>) demonstrate that CoPDASeg improves the segmentation performance over state-of-the-arts by a large margin.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104464"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144925209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianqi Li , Pingping Liu , Qiuzhan Zhou , Tongshun Zhang
{"title":"Adaptive illumination and noise-free detail recovery via visual decomposition for low-light image enhancement","authors":"Tianqi Li , Pingping Liu , Qiuzhan Zhou , Tongshun Zhang","doi":"10.1016/j.cviu.2025.104466","DOIUrl":"10.1016/j.cviu.2025.104466","url":null,"abstract":"<div><div>Existing low-light image enhancement methods often struggle with precise brightness control and frequently introduce noise during the enhancement process. To address these limitations, we propose BVILLIE, a novel biologically inspired visual model. BVILLIE employs a visual decomposition network that separates low-light images into low-frequency and high-frequency components, with the low-frequency path focused on brightness management and the high-frequency path enhancing details without amplifying noise. In the low-frequency path, inspired by the biological visual system’s adaptive response to varying light conditions, BVILLIE incorporates a custom-designed luminance curve based on the Naka–Rushton equation. This equation models the nonlinear response of retinal neurons to light intensity, simulating human perceptual adaptation to different brightness levels. Additionally, a convolutional enhancement module corrects color shifts resulting from luminance adjustments. In the high-frequency path, an innovative fusion module integrates a preliminary denoiser with an adaptive enhancement mechanism to improve detail preservation and texture refinement. Extensive experiments across multiple benchmark datasets demonstrate that BVILLIE significantly outperforms state-of-the-art techniques. For instance, on the LOLv2-Real dataset, BVILLIE achieves a PSNR of 25.335 dB, SSIM of 0.866, LPIPS of 0.106, and LOE of 0.208. These results, consistently observed across various metrics, highlight BVILLIE’s superior performance in terms of image quality, perceptual similarity, preservation of lightness order, detail enhancement, and noise suppression.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104466"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144878253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dahua Gao, Wenlong Wang, Xinyu Liu, Yuxi Hu, Danhua Liu
{"title":"Physics-guided human interaction generation via motion diffusion model","authors":"Dahua Gao, Wenlong Wang, Xinyu Liu, Yuxi Hu, Danhua Liu","doi":"10.1016/j.cviu.2025.104470","DOIUrl":"10.1016/j.cviu.2025.104470","url":null,"abstract":"<div><div>Denoising diffusion model significantly boosts the generation of two-person interactions conditioned on textual descriptions. However, due to the complexity of interactions and the diversity of textual descriptions, motion generation still faces two critical challenges: The self-induced motion and the increasing error accumulation with more denoised steps. To address these issues, we propose a novel Physics-guided human Interaction generation framework based on motion diffusion model, named PhyInter. It can synthesize contextually appropriate motion, automatically learn the dynamic states of the other participant without additional annotation, and also optimize the errors of generation by guiding the next denoising diffusion step. Specifically, PhyInter integrates physical principles from two perspectives: (1) Defining a stochastic differential equation based on human kinematics to model the physical states of interaction; (2) Employing an interactive attention module to share physical information between intra- and inter-human motions. Additionally, we design a sampling strategy to facilitate motion generation and avoid unnecessary computation, ensuring realistic, physically-plausible interactions. Extensive experiments demonstrate that our method surpasses previous approaches on the InterHuman dataset, achieving the state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104470"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145004207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Botao Li , Huguang Yang , Chenglong Xia , Han Zheng , Aziguli Wulamu , Taohong Zhang
{"title":"Adaptive DETR: A framework with dynamic sampling points and feature-guided adaptive attention updates","authors":"Botao Li , Huguang Yang , Chenglong Xia , Han Zheng , Aziguli Wulamu , Taohong Zhang","doi":"10.1016/j.cviu.2025.104481","DOIUrl":"10.1016/j.cviu.2025.104481","url":null,"abstract":"<div><div>In recent years, DETR-based models have advanced object detection but still face key challenges: the encoder’s high complexity and limited adaptability, and the decoder’s slow convergence due to query initialization. We propose Adaptive DETR, a framework with dynamic sampling and adaptive feature encoding. First, we design an attention update strategy that computes weights based on image features, enhancing detection accuracy. Second, we enable dynamic adjustment of sampling points in deformable attention, improving adaptability in complex scenes. Finally, we optimize the decoder by performing attention between bounding-box and semantic queries during initialization, effectively injecting semantics, accelerating convergence, and improving localization. Experiments on COCO, UAVDT, VisDrone, and RSOD confirm that Adaptive DETR achieves superior accuracy and generalization with improved efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104481"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144902259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EGLC: Enhancing Global Localization Capability for medical image segmentation","authors":"Yulong Wan , Dongming Zhou , Ran Yan","doi":"10.1016/j.cviu.2025.104471","DOIUrl":"10.1016/j.cviu.2025.104471","url":null,"abstract":"<div><div>Medical image segmentation plays a vital role in computer-aided diagnosis and treatment planning. Traditional convolutional networks excel at capturing local patterns, while Transformer-based models are effective at modeling global context. We observe that this advantage arises from the global model’s sensitivity to boundary information, whereas local modeling tends to focus on regional consistency. Based on this insight, we propose EGLC, a novel global-local collaborative segmentation framework. During global modeling, we progressively discard inattentive patches and apply wavelet transform to extract multi-frequency boundary features. These boundary features are then used as guidance to enhance local representations. To implement this strategy, we introduce a new encoder, Boundary PVT, which incorporates both global semantics and boundary cues. In the decoding phase, we design a Reverse Progressive Locality Decoder to redirect attention to the peripheral edges of the lesion, thereby improving boundary delineation. Extensive experiments on multiple public medical image datasets demonstrate that our EGLC framework consistently outperforms existing state-of-the-art methods, especially in preserving fine-grained boundary details. The proposed approach offers a promising direction for precise and robust medical image segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104471"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PConvSRGAN: Real-world super-resolution reconstruction with pure convolutional networks","authors":"Zuopeng Zhao, Yumeng Gao, Bingbing Min, Xiaoran Miao, Jianfeng Hu, Ying Liu, Kanyaphakphachsorn Pharksuwan","doi":"10.1016/j.cviu.2025.104465","DOIUrl":"10.1016/j.cviu.2025.104465","url":null,"abstract":"<div><div>Image super-resolution (SR) reconstruction technology faces numerous challenges in real-world applications: image degradation types are diverse, complex, and unknown; the diversity of imaging devices increases the complexity of image degradation in the super-resolution reconstruction process; SR requires substantial computational resources, especially with the latest significantly effective Transformer-based SR methods. To address these issues, we improved the ESRGAN model by implementing the following: first, a probabilistic degradation model was added to simulate the degradation process, preventing overfitting to specific degradations; second, BiFPN was introduced in the generator to fuse multi-scale features; lastly, inspired by the ConvNeXt network, the discriminator was redesigned as a pure convolutional network built entirely from standard CNN modules, which matches Transformer performance across various aspects. Experimental results demonstrate that our approach achieves the best PI and LPIPS performance compared to state-of-the-art SR methods, with PSNR,SSIM and NIQE being on par. Visualization results show that our method not only generates natural SR images but also excels in restoring structures.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104465"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}