Zhiming Wang , Sheng Xu , Li’an Zhuo , Baochang Zhang , Yanjing Li , Zhenqian Wang , Guodong Guo
{"title":"Calibrated gradient descent of convolutional neural networks for embodied visual recognition","authors":"Zhiming Wang , Sheng Xu , Li’an Zhuo , Baochang Zhang , Yanjing Li , Zhenqian Wang , Guodong Guo","doi":"10.1016/j.imavis.2025.105568","DOIUrl":"10.1016/j.imavis.2025.105568","url":null,"abstract":"<div><div>Embodied visual computing seeks to learn from the real world, which requires an efficient machine learning methods. In conventional stochastic gradient descent (SGD) and its variants, the gradient estimators are expensive to be computed in many scenarios. This paper introduces a calibrated gradient descent (CGD) algorithm for efficient deep neural network optimization. A theorem is developed to prove that an unbiased estimator for network parameters can be obtained in a probabilistic way based on the Lipschitz hypothesis. We implement our CGD algorithm based on widely-used SGD and ADAM optimizers. We achieve a generic gradient calibration layer (<span><math><mrow><mi>G</mi><mi>C</mi><mi>l</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi></mrow></math></span>) which can be used to improve the performance of convolutional neural networks (C-CNNs). Our <span><math><mrow><mi>G</mi><mi>C</mi><mi>l</mi><mi>a</mi><mi>y</mi><mi>e</mi><mi>r</mi></mrow></math></span> only introduces extra parameters during training process, but not affect the efficiency of the inference process. Our method is generic and effective to optimize both CNNs and also quantized neural networks (C-QNNs). Extensive experimental results demonstrate that our method can achieve the state-of-the-art performance for a variety of tasks. For example, our 1-bit Faster-RCNN achieved by C-QNN obtains 20.5% mAP on COCO, leading to a new state-of-the-art performance. This work brings new insights for developing more efficient optimizers and analyzing the back-propagation algorithm.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105568"},"PeriodicalIF":4.2,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144083672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqiang Xu , Chunying Song , Chao Shi , Huafeng Liu , Qiong Wang
{"title":"UncertainBEV: Uncertainty-aware BEV fusion for roadside 3D object detection","authors":"Jianqiang Xu , Chunying Song , Chao Shi , Huafeng Liu , Qiong Wang","doi":"10.1016/j.imavis.2025.105567","DOIUrl":"10.1016/j.imavis.2025.105567","url":null,"abstract":"<div><div>With the rapid development of autonomous driving technology and intelligent transportation systems, multimodal fusion-based Bird’s-Eye-View (BEV) perception has become a key technique for environmental understanding. However, existing methods suffer from feature misalignment caused by calibration errors between different sensors, ultimately limiting the effectiveness of multimodal fusion. In this paper, we propose a robust roadside BEV perception framework, named UncertainBEV. To address feature misalignment caused by projection inaccuracies between LiDAR and camera sensors, we introduce a novel module called UncertainFuser, which models the uncertainty of both camera and LiDAR features to dynamically adjust fusion weights, thereby mitigating feature misalignment. Additionally, we optimize the sparse voxel pooling module and design a multi-head attention mechanism to enhance the quality of BEV features from both modalities. Built upon the CUDA-V2XFusion and BEVFusion frameworks, our proposed UncertainBEV achieves state-of-the-art performance on the DAIR-V2X-I dataset, with 3D mean Average Precision (mAP) improvements of 2.88%, 7.73%, and 3.68% for vehicles, pedestrians, and cyclists, respectively. Our code has been open-sourced at <span><span>UncertainBEV</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105567"},"PeriodicalIF":4.2,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143931632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xing Wang , Wei Wang , Shixiang Su , Mingqi Lu , Lei Zhang , Xiaobo Lu
{"title":"A landmarks-assisted diffusion model with heatmap-guided denoising loss for high-fidelity and controllable facial image generation","authors":"Xing Wang , Wei Wang , Shixiang Su , Mingqi Lu , Lei Zhang , Xiaobo Lu","doi":"10.1016/j.imavis.2025.105545","DOIUrl":"10.1016/j.imavis.2025.105545","url":null,"abstract":"<div><div>Diffusion models have significantly advanced image generation, enabling users to create diverse and realistic images from simple prompts. However, generating high-fidelity, controllable facial images remains a challenge due to the intricate details of human faces. In this paper, we present a novel diffusion model for landmarks-assisted text to face generation which directly incorporates landmarks as guidance during the diffusion process. To address the issue of global information degradation caused by fine-tuning with local information, we introduce a heatmap-guided denoising loss that selectively focuses on feature pixels most relevant to the conditioning. This biased learning strategy ensures that the model prioritizes shape and positional information, preventing excessive deterioration of its generalization ability. Unlike existing methods relying on an extra learnable branch for conditional control, our native method eliminates the conflicts inherent in dual-branch architectures when dealing with various conditions. It also enables precise manipulation of facial features, such as shape and position. Extensive experiments on CelebA-HQ and CelebAText-HQ dataset show that our method demonstrates superior performance in generating realistic and controllable facial images, outperforming existing methods in terms of fidelity, diversity, and alignment with specified landmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105545"},"PeriodicalIF":4.2,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingzheng Wang , Jiazhi Xie , Ning Li, Xingqin Wang, Wenhui Liu, Zengwei Mai
{"title":"Incremental structural adaptation for camouflaged object detection","authors":"Qingzheng Wang , Jiazhi Xie , Ning Li, Xingqin Wang, Wenhui Liu, Zengwei Mai","doi":"10.1016/j.imavis.2025.105565","DOIUrl":"10.1016/j.imavis.2025.105565","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD) is a challenging task due to the similarity between camouflaged objects and their backgrounds. Recent approaches predominantly utilize structural cues but often struggle with misinterpretations and noise, particularly for small objects. To address these issues, we propose the Structure-Adaptive Network (SANet), which incrementally supplements structural information from points to surfaces. Our method includes the Key Point Structural Information Prompting Module (KSIP) to enhance point-level structural information, Mixed-Resolution Attention (MRA) to incorporate high-resolution details, and the Structural Adaptation Patch (SAP) to selectively integrate high-resolution patches based on the shape of the camouflaged object. Experimental results on three widely used COD datasets demonstrate that SANet significantly outperforms state-of-the-art methods, achieving more accurate localization and finer edge segmentation, while minimizing background noise. Our code is available at <span><span>https://github.com/vstar37/SANet/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105565"},"PeriodicalIF":4.2,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keliang Chen , Zongze Li , Fang Cui , Mao Ni , Shaoying Wang , Junlin Che , Feng Liu , Yonggang Qi , Fangwei Zhang , Jun Liu , Gan Guo , Rongrong Fu , Yunxia Huang
{"title":"FastTalker: Real-time audio-driven talking face generation with 3D Gaussian","authors":"Keliang Chen , Zongze Li , Fang Cui , Mao Ni , Shaoying Wang , Junlin Che , Feng Liu , Yonggang Qi , Fangwei Zhang , Jun Liu , Gan Guo , Rongrong Fu , Yunxia Huang","doi":"10.1016/j.imavis.2025.105573","DOIUrl":"10.1016/j.imavis.2025.105573","url":null,"abstract":"<div><div>The performance of 3D talking head generation has shown significant im- provement over the past few years. Nevertheless, real-time rendering remains a challenge that needs to be overcome. To address this issue, we present the FastTalker framework, which uses 3D Gaussian Splatting (3DGS) for talking head generation. This method introduces an audio-driven Dynamic Neural Skinning (DNS) approach to facilitate flexible and high-fidelity talking head video generation. It first employs an adaptive FLAME mesh for sampling to obtain the initialized 3DGS. Then, Neural Skinning Networks (DNS) are used to account for the appearance changes of 3DGS. Finally, a pre-trained Audio Motion Net is utilized to model facial movements as the final dynamic driving facial signal. Experimental results demonstrate that FastTalker of- fers a rendering speed exceeding 100 FPS, making it the fastest audio-driven talking head generation method in terms of inference efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105573"},"PeriodicalIF":4.2,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143905869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhou , Yingyuan Wang , Lina Zuo , Dan Ma , Yugen Yi
{"title":"Self-distillation guided Semantic Knowledge Feedback network for infrared–visible image fusion","authors":"Wei Zhou , Yingyuan Wang , Lina Zuo , Dan Ma , Yugen Yi","doi":"10.1016/j.imavis.2025.105566","DOIUrl":"10.1016/j.imavis.2025.105566","url":null,"abstract":"<div><div>Infrared–visible image fusion combines complementary information from both modalities to enhance visual quality and support downstream tasks. However, existing methods typically enhance semantic information by designing fusion functions for source images and combining them with downstream network, overlooking the optimization and guidance of the fused image itself. This neglect weakens the semantic knowledge within the fused image, limiting its alignment with task objectives and reducing accuracy in downstream tasks. To overcome these limitations, we propose the self-distillation guided Semantic Knowledge Feedback (SKFFusion) network, which extracts semantic knowledge from the fused image and feeds it back to iteratively optimize the fusion process, addressing the lack of semantic guidance. Specifically, we introduce shallow-to-deep feature fusion modules, including Shallow Texture Fusion (STF) and Deep Semantic Fusion (DSF) to integrate fine-grained details and high-level semantics. The STF uses channel and spatial attention mechanisms to aggregate detailed multi-modal information, while the DSF leverages a Mamba structure to capture long-range dependencies, enabling deeper cross-modal semantic fusion. Additionally, we design a CNN-Transformer-based Knowledge Feedback Network (KFN) to extract local detail features and capture global dependencies. A Semantic Attention Guidance (SAG) further refines the fused image’s semantic representation, aligning it with task objectives. Finally, a distillation loss provides more robust training and excellent image quality. Experimental results show that SKFFusion outperforms existing methods in visual quality and vision task performance, particularly under challenging conditions like low-light and fog. Our code is available at <span><span>https://github.com/yyzzttkkjj/SKFFusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105566"},"PeriodicalIF":4.2,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143924945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Wang , Jie Li , Chun Qi , Xuan Wu , Runrun Zou , Fengping Wang , Pan Wang
{"title":"A neighbor-aware feature enhancement network for crowd counting","authors":"Lin Wang , Jie Li , Chun Qi , Xuan Wu , Runrun Zou , Fengping Wang , Pan Wang","doi":"10.1016/j.imavis.2025.105578","DOIUrl":"10.1016/j.imavis.2025.105578","url":null,"abstract":"<div><div>Deep neural networks have achieved significant progress in the field of crowd counting in recent years. However, many networks still face challenges in effectively representing crowd features due to the insufficient exploitation of inter-channel and inter-pixel relationships. To overcome these limitations, we propose the Neighbor-Aware Feature Enhancement Network (NAFENet), a novel architecture designed to strengthen feature representation by adequately leveraging both channel and pixel dependencies. Specifically, we introduce two modules to model channel dependencies: the Across Channel Attention Module (ACAM) and the Channel Residual Module (CRM). ACAM computes a relevance map to quantify the influence of adjacent channels on the current channel and extracts valuable information to enrich the feature representation. On the other hand, CRM learns the residual maps between adjacent channels to capture their correlations and differences, enabling the network to gain a deeper understanding of the image content. In addition, we embed a Spatial Correlation Module (SCM) in NAFENet to model long-range dependencies between pixels across neighboring rows to analyze long continuous structures more effectively. Experimental results on six challenging datasets demonstrate that the proposed method achieves impressive performance compared to state-of-the-art models. Complexity analysis further reveals that our model is more efficient, requiring less time and fewer computational resources than other approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105578"},"PeriodicalIF":4.2,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143924944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SinWaveFusion: Learning a single image diffusion model in wavelet domain","authors":"Jisoo Kim , Jiwoo Kang , Taewan Kim , Heeseok Oh","doi":"10.1016/j.imavis.2025.105551","DOIUrl":"10.1016/j.imavis.2025.105551","url":null,"abstract":"<div><div>Although recent advancements in large-scale image generation models have substantially improved visual fidelity and reliability, current diffusion models continue to encounter significant challenges in maintaining stylistic consistency with the original images. These challenges stem primarily from the intrinsic stochastic nature of the diffusion process, leading to noticeable variability and inconsistency in edited outputs. To address these challenges, this paper proposes a novel framework termed <em>single image wavelet diffusion (SinWaveFusion)</em>, explicitly designed to enhance the consistency and fidelity in generating images derived from a single source image while also mitigating information leakage. SinWaveFusion addresses generative artifacts by employing the multi-scale properties inherent in wavelet decomposition, which incorporates a built-in up-down scaling mechanism. This approach enables refined image manipulation while enhancing stylistic coherence. The proposed diffusion model, trained exclusively on a single source image, utilizes the hierarchical structure of wavelet subbands to effectively capture spatial and spectral information in the sampling process, minimizing reconstruction loss and ensuring high-quality, diverse outputs. Moreover, the architecture of the denoiser features a reduced receptive field, strategically preventing the model from memorizing the entire training image and thereby offering additional computational efficiency benefits. Experimental results demonstrate that SinWaveFusion achieves improved performance in both conditional and unconditional generation compared to existing generative models trained on a single image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105551"},"PeriodicalIF":4.2,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143912344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating end-to-end multimodal deep learning and domain adaptation for robust facial expression recognition","authors":"Mahmoud Hassaballah , Chiara Pero , Ranjeet Kumar Rout , Saiyed Umer","doi":"10.1016/j.imavis.2025.105548","DOIUrl":"10.1016/j.imavis.2025.105548","url":null,"abstract":"<div><div>This paper presents an advanced approach to a facial expression recognition (FER) system designed for robust performance across diverse imaging environments. The proposed method consists of four primary components: image preprocessing, feature representation and classification, cross-domain feature analysis, and domain adaptation. The process begins with facial region extraction from input images, including those captured in unconstrained imaging conditions, where variations in lighting, background, and image quality significantly impact recognition performance. The extracted facial region undergoes feature extraction using an ensemble of multimodal deep learning techniques, including end-to-end CNNs, BilinearCNN, TrilinearCNN, and pretrained CNN models, which capture both local and global facial features with high precision. The ensemble approach enriches feature representation by integrating information from multiple models, enhancing the system’s ability to generalize across different subjects and expressions. These deep features are then passed to a classifier trained to recognize facial expressions effectively in real-time scenarios. Since images captured in real-world conditions often contain noise and artifacts that can compromise accuracy, cross-domain analysis is performed to evaluate the discriminative power and robustness of the extracted deep features. FER systems typically experience performance degradation when applied to domains that differ from the original training environment. To mitigate this issue, domain adaptation techniques are incorporated, enabling the system to effectively adjust to new imaging conditions and improving recognition accuracy even in challenging real-time acquisition environments. The proposed FER system is validated using four well-established benchmark datasets: CK+, KDEF, IMFDB and AffectNet. Experimental results demonstrate that the proposed system achieves high performance within original domains and exhibits superior cross-domain recognition compared to existing state-of-the-art methods. These findings indicate that the system is highly reliable for applications requiring robust and adaptive FER capabilities across varying imaging conditions and domains.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105548"},"PeriodicalIF":4.2,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143899659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minh Tran , Thang Pham , Winston Bounsavy , Tri Nguyen , Ngan Le
{"title":"A2VIS: Amodal-Aware Approach to Video Instance Segmentation","authors":"Minh Tran , Thang Pham , Winston Bounsavy , Tri Nguyen , Ngan Le","doi":"10.1016/j.imavis.2025.105543","DOIUrl":"10.1016/j.imavis.2025.105543","url":null,"abstract":"<div><div>Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105543"},"PeriodicalIF":4.2,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143924946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}