Computer Vision and Image Understanding最新文献_第3页

Adaptive bias learning via gradient-based reweighting and constrained pruning for robust Visual Question Answering 基于梯度加权和约束剪枝的自适应偏差学习鲁棒视觉问答

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-28 DOI: 10.1016/j.cviu.2025.104484

Zukun Wan , Runmin Wang , Xingdong Song , Juan Xu , Xiaofei Cao , Jielei Hei , Shengrong Yuan , Yajun Ding , Changxin Gao

{"title":"Adaptive bias learning via gradient-based reweighting and constrained pruning for robust Visual Question Answering","authors":"Zukun Wan , Runmin Wang , Xingdong Song , Juan Xu , Xiaofei Cao , Jielei Hei , Shengrong Yuan , Yajun Ding , Changxin Gao","doi":"10.1016/j.cviu.2025.104484","DOIUrl":"10.1016/j.cviu.2025.104484","url":null,"abstract":"<div><div>Visual Question Answering (VQA) presents significant challenges in cross-modal reasoning due to susceptibility to dataset biases, spurious correlations, and shortcuts learning, which undermine model robustness. While ensemble methods mitigate bias via joint optimization of a bias model and a target model during training, their efficacy remains limited by suboptimal bias exploitation and model capacity imbalances. To address this, we propose the Adaptive Bias Learning Network (ABLNet), a novel framework that systematically enhances bias capture for improved generalization. Our approach introduces two key innovations: (1) Gradient-driven sample reweighting, which quantifies per-sample bias magnitude via training gradients and prioritizes low-bias samples to refine bias model training; (2) Constrained network pruning, deliberately restricting bias model capacity to amplify its focus on bias patterns. Extensive evaluations on VQA-CPv1, VQA-CPv2, and VQA-v2 benchmarks confirm our ABLNet’s superiority, demonstrating generalizability across diverse question types. The code will be released at <span><span>https://github.com/runminwang/ABLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104484"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal vs. unimodal approaches to uncertainty in 3D image segmentation under distribution shifts 分布变化下三维图像分割不确定性的多模态与单模态方法

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-28 DOI: 10.1016/j.cviu.2025.104473

Masoumeh Javanbakhat , Md Tasnimul Hasan , Cristoph Lippert

{"title":"Multimodal vs. unimodal approaches to uncertainty in 3D image segmentation under distribution shifts","authors":"Masoumeh Javanbakhat , Md Tasnimul Hasan , Cristoph Lippert","doi":"10.1016/j.cviu.2025.104473","DOIUrl":"10.1016/j.cviu.2025.104473","url":null,"abstract":"<div><div>Machine learning has been widely adopted across sectors, yet its application in medical imaging remains challenging due to distribution shifts in real-world data. Deployed models often encounter samples that differ from the training dataset, particularly in the health domain, leading to performance issues. This limitation hinders the expressiveness and reliability of deep learning models in health applications. Thus, it becomes crucial to identify methods capable of producing reliable uncertainty estimation in the context of distribution shifts in the health sector. In this paper, we explore the feasibility of using cutting-edge Bayesian and non-Bayesian methods to detect distributionally shifted samples, aiming to achieve reliable and trustworthy diagnostic predictions in segmentation task. Specifically, we compare three distinct uncertainty estimation methods, each designed to capture either unimodal or multimodal aspects in the posterior distribution. Our findings demonstrate that methods capable of addressing <em>multimodal</em> characteristics in the posterior distribution, offer more dependable uncertainty estimates. This research contributes to enhancing the utility of deep learning in healthcare, making diagnostic predictions more robust and trustworthy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104473"},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GaitBranch: A multi-branch refinement model combined with frame-channel attention mechanism for gait recognition GaitBranch：一种结合帧-通道注意机制的多分支精细模型

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-25 DOI: 10.1016/j.cviu.2025.104463

Huakang Li , Yidan Qiu , Huimin Zhao , Jin Zhan , Rongjun Chen , Jinchang Ren , Ying Gao , Wing W.Y. Ng

{"title":"GaitBranch: A multi-branch refinement model combined with frame-channel attention mechanism for gait recognition","authors":"Huakang Li , Yidan Qiu , Huimin Zhao , Jin Zhan , Rongjun Chen , Jinchang Ren , Ying Gao , Wing W.Y. Ng","doi":"10.1016/j.cviu.2025.104463","DOIUrl":"10.1016/j.cviu.2025.104463","url":null,"abstract":"<div><div>Accurately representing human motion in video-based gait recognition is challenging due to the difficulty in obtaining an ideal gait silhouette sequence that captures comprehensive information. To address this challenge, we propose GaitBranch, a novel method that emphasizes local key information of human motion in different layers of the neural network. It divides the neural network into multiple branches using the multi-branch refinement (MBR) module and extracts local key frames from various body parts through the frame-channel attention mechanism (FCAM) to form a comprehensive representation of human motion patterns. GaitBranch achieves high gait recognition accuracy on the CASIA-B (98.6%, 96.1%, and 85.5% for normal walking, carrying a bag, and wearing a coat conditions), OU-MVLP (92.3%), and GREW (79.8%) datasets, demonstrating its robustness across different environments. Ablation experiments confirm the efficacy of our method and demonstrate that the performance gains result from the optimized model structure rather than simply increasing parameters.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104463"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unified learning for image–text alignment via multi-scale feature fusion 基于多尺度特征融合的图像-文本对齐统一学习

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-25 DOI: 10.1016/j.cviu.2025.104468

Jing Zhou , Meng Wang

{"title":"Unified learning for image–text alignment via multi-scale feature fusion","authors":"Jing Zhou , Meng Wang","doi":"10.1016/j.cviu.2025.104468","DOIUrl":"10.1016/j.cviu.2025.104468","url":null,"abstract":"<div><div>Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at <span><span>https://github.com/wangmeng-007/MATE/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104468"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144912623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context perturbation: A Consistent alignment approach for Domain Adaptive Semantic Segmentation 上下文扰动：领域自适应语义分割的一致对齐方法

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-25 DOI: 10.1016/j.cviu.2025.104464

Meiqin Liu , Zilin Wang , Chao Yao , Yao Zhao , Wei Wang , Yunchao Wei

{"title":"Context perturbation: A Consistent alignment approach for Domain Adaptive Semantic Segmentation","authors":"Meiqin Liu , Zilin Wang , Chao Yao , Yao Zhao , Wei Wang , Yunchao Wei","doi":"10.1016/j.cviu.2025.104464","DOIUrl":"10.1016/j.cviu.2025.104464","url":null,"abstract":"<div><div>Domain Adaptive Semantic Segmentation (DASS) aims to adapt a pre-trained segmentation model from a labeled source domain to an unlabeled target domain. Previous approaches usually address the domain gap by consistency regularization which is implemented based on the augmented data. However, as the augmentations are often performed at the input level with simple linear transformations, the feature representations suffer limited perturbation from these augmented views. As a result, they are not effective for cross-domain consistency learning. In this work, we propose a new augmentation method, namely contextual augmentation, and combine it with contrastive learning approaches from both the pixel and class levels to achieve consistency regularization. We term this methodology as Context Perturbation for DASS (CoPDASeg). Specifically, contextual augmentation first combines domain information by class mix and then randomly crops two patches with an overlapping region. To achieve consistency regularization with the two augmented patches, we focus on both pixel and class perspectives and propose two parallel contrastive learning paradigms (<em>i.e.</em>, pixel-level contrastive learning and class-level contrastive learning). The former aligns the pixel-to-pixel feature representations, and later aligns class prototypes across domains. Experimental results on representative benchmarks (<em>i.e.</em>, <strong>GTA5</strong> <span><math><mo>→</mo></math></span><strong>Cityscapes</strong> and <strong>SYNTHIA</strong> <span><math><mo>→</mo></math></span> <strong>Cityscapes</strong>) demonstrate that CoPDASeg improves the segmentation performance over state-of-the-arts by a large margin.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104464"},"PeriodicalIF":3.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144925209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive illumination and noise-free detail recovery via visual decomposition for low-light image enhancement 自适应照明和无噪声的细节恢复通过视觉分解低光图像增强

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104466

Tianqi Li , Pingping Liu , Qiuzhan Zhou , Tongshun Zhang

{"title":"Adaptive illumination and noise-free detail recovery via visual decomposition for low-light image enhancement","authors":"Tianqi Li , Pingping Liu , Qiuzhan Zhou , Tongshun Zhang","doi":"10.1016/j.cviu.2025.104466","DOIUrl":"10.1016/j.cviu.2025.104466","url":null,"abstract":"<div><div>Existing low-light image enhancement methods often struggle with precise brightness control and frequently introduce noise during the enhancement process. To address these limitations, we propose BVILLIE, a novel biologically inspired visual model. BVILLIE employs a visual decomposition network that separates low-light images into low-frequency and high-frequency components, with the low-frequency path focused on brightness management and the high-frequency path enhancing details without amplifying noise. In the low-frequency path, inspired by the biological visual system’s adaptive response to varying light conditions, BVILLIE incorporates a custom-designed luminance curve based on the Naka–Rushton equation. This equation models the nonlinear response of retinal neurons to light intensity, simulating human perceptual adaptation to different brightness levels. Additionally, a convolutional enhancement module corrects color shifts resulting from luminance adjustments. In the high-frequency path, an innovative fusion module integrates a preliminary denoiser with an adaptive enhancement mechanism to improve detail preservation and texture refinement. Extensive experiments across multiple benchmark datasets demonstrate that BVILLIE significantly outperforms state-of-the-art techniques. For instance, on the LOLv2-Real dataset, BVILLIE achieves a PSNR of 25.335 dB, SSIM of 0.866, LPIPS of 0.106, and LOE of 0.208. These results, consistently observed across various metrics, highlight BVILLIE’s superior performance in terms of image quality, perceptual similarity, preservation of lightness order, detail enhancement, and noise suppression.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104466"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144878253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Physics-guided human interaction generation via motion diffusion model 通过运动扩散模型生成物理引导的人类互动

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104470

Dahua Gao, Wenlong Wang, Xinyu Liu, Yuxi Hu, Danhua Liu

{"title":"Physics-guided human interaction generation via motion diffusion model","authors":"Dahua Gao, Wenlong Wang, Xinyu Liu, Yuxi Hu, Danhua Liu","doi":"10.1016/j.cviu.2025.104470","DOIUrl":"10.1016/j.cviu.2025.104470","url":null,"abstract":"<div><div>Denoising diffusion model significantly boosts the generation of two-person interactions conditioned on textual descriptions. However, due to the complexity of interactions and the diversity of textual descriptions, motion generation still faces two critical challenges: The self-induced motion and the increasing error accumulation with more denoised steps. To address these issues, we propose a novel Physics-guided human Interaction generation framework based on motion diffusion model, named PhyInter. It can synthesize contextually appropriate motion, automatically learn the dynamic states of the other participant without additional annotation, and also optimize the errors of generation by guiding the next denoising diffusion step. Specifically, PhyInter integrates physical principles from two perspectives: (1) Defining a stochastic differential equation based on human kinematics to model the physical states of interaction; (2) Employing an interactive attention module to share physical information between intra- and inter-human motions. Additionally, we design a sampling strategy to facilitate motion generation and avoid unnecessary computation, ensuring realistic, physically-plausible interactions. Extensive experiments demonstrate that our method surpasses previous approaches on the InterHuman dataset, achieving the state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104470"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145004207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive DETR: A framework with dynamic sampling points and feature-guided adaptive attention updates 自适应DETR：一个具有动态采样点和特征引导的自适应注意力更新的框架

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104481

Botao Li , Huguang Yang , Chenglong Xia , Han Zheng , Aziguli Wulamu , Taohong Zhang

引用次数: 0

EGLC: Enhancing Global Localization Capability for medical image segmentation EGLC：增强医学图像分割的全局定位能力

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104471

Yulong Wan , Dongming Zhou , Ran Yan

{"title":"EGLC: Enhancing Global Localization Capability for medical image segmentation","authors":"Yulong Wan , Dongming Zhou , Ran Yan","doi":"10.1016/j.cviu.2025.104471","DOIUrl":"10.1016/j.cviu.2025.104471","url":null,"abstract":"<div><div>Medical image segmentation plays a vital role in computer-aided diagnosis and treatment planning. Traditional convolutional networks excel at capturing local patterns, while Transformer-based models are effective at modeling global context. We observe that this advantage arises from the global model’s sensitivity to boundary information, whereas local modeling tends to focus on regional consistency. Based on this insight, we propose EGLC, a novel global-local collaborative segmentation framework. During global modeling, we progressively discard inattentive patches and apply wavelet transform to extract multi-frequency boundary features. These boundary features are then used as guidance to enhance local representations. To implement this strategy, we introduce a new encoder, Boundary PVT, which incorporates both global semantics and boundary cues. In the decoding phase, we design a Reverse Progressive Locality Decoder to redirect attention to the peripheral edges of the lesion, thereby improving boundary delineation. Extensive experiments on multiple public medical image datasets demonstrate that our EGLC framework consistently outperforms existing state-of-the-art methods, especially in preserving fine-grained boundary details. The proposed approach offers a promising direction for precise and robust medical image segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104471"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PConvSRGAN: Real-world super-resolution reconstruction with pure convolutional networks PConvSRGAN：基于纯卷积网络的真实世界超分辨率重建

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104465

Zuopeng Zhao, Yumeng Gao, Bingbing Min, Xiaoran Miao, Jianfeng Hu, Ying Liu, Kanyaphakphachsorn Pharksuwan

{"title":"PConvSRGAN: Real-world super-resolution reconstruction with pure convolutional networks","authors":"Zuopeng Zhao, Yumeng Gao, Bingbing Min, Xiaoran Miao, Jianfeng Hu, Ying Liu, Kanyaphakphachsorn Pharksuwan","doi":"10.1016/j.cviu.2025.104465","DOIUrl":"10.1016/j.cviu.2025.104465","url":null,"abstract":"<div><div>Image super-resolution (SR) reconstruction technology faces numerous challenges in real-world applications: image degradation types are diverse, complex, and unknown; the diversity of imaging devices increases the complexity of image degradation in the super-resolution reconstruction process; SR requires substantial computational resources, especially with the latest significantly effective Transformer-based SR methods. To address these issues, we improved the ESRGAN model by implementing the following: first, a probabilistic degradation model was added to simulate the degradation process, preventing overfitting to specific degradations; second, BiFPN was introduced in the generator to fuse multi-scale features; lastly, inspired by the ConvNeXt network, the discriminator was redesigned as a pure convolutional network built entirely from standard CNN modules, which matches Transformer performance across various aspects. Experimental results demonstrate that our approach achieves the best PI and LPIPS performance compared to state-of-the-art SR methods, with PSNR,SSIM and NIQE being on par. Visualization results show that our method not only generates natural SR images but also excels in restoring structures.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104465"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0