Haishun Du , Wenzhe Zhang , Sen Wang , Zhengyang Zhang , Linbing Cao
{"title":"FFENet: A frequency fusion and enhancement network for camouflaged object detection","authors":"Haishun Du , Wenzhe Zhang , Sen Wang , Zhengyang Zhang , Linbing Cao","doi":"10.1016/j.imavis.2025.105733","DOIUrl":"10.1016/j.imavis.2025.105733","url":null,"abstract":"<div><div>The goal of camouflaged object detection (COD) is to accurately find camouflaged objects hidden in their surroundings. Although most of the existing frequency-domain based COD models can boost the performance of COD to a certain extent by utilizing the frequency domain information, the frequency feature fusion strategies they adopt tend to ignore the complementary effects between high-frequency features and low-frequency features. In addition, most of the existing frequency-domain based COD models also do not consider enhancing camouflaged objects using low-level frequency-domain features. In order to solve these problems, we present a frequency fusion and enhancement network (FFENet) for camouflaged object detection, which mainly includes three stages. In the frequency feature extraction stage, we design a frequency feature learning module (FLM) to extract corresponding high-frequency features and low-frequency features. In the frequency feature fusion stage, we design a frequency feature fusion module (FFM) that can increase the representation ability of the fused features by adaptively assigning weights to the high-frequency features and the low-frequency features using a cross-attention mechanism. In the frequency feature guidance information enhancement stage, we design a frequency feature guidance information enhancement module (FGIEM) to enhance the contextual information and detail information of camouflaged objects in the fused features under the guidance of the low-level frequency features. Extensive experimental results on the COD10K, CHAMELEON, NC4K and CAMO datasets show that our model is superior to most existing COD models.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105733"},"PeriodicalIF":4.2,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning enhanced monocular visual odometry: Advancements in fusion mechanisms and training strategies","authors":"E. Simsek , B. Ozyer","doi":"10.1016/j.imavis.2025.105732","DOIUrl":"10.1016/j.imavis.2025.105732","url":null,"abstract":"<div><div>Recent advances in deep learning have revolutionized robotic applications such as 3D mapping, visual navigation and autonomous control. Monocular Visual Odometry (MVO) represents a critical advancement in autonomous systems, particularly drones, utilizing single-camera setups to navigate complex environments effectively. This review explores MVO’s evolution from traditional methods to its integration with cutting-edge technologies like deep learning and semantic understanding. In this study, we explore the latest training strategies, innovations in model architecture, and advanced fusion techniques used in hybrid models that combine depth and semantic information. A comprehensive literature review traces the evolution of MVO techniques, highlighting key datasets and performance metrics. Section 2 outlines the problem, while Section 3 reviews the studies, charting the evolution of MVO techniques predating the advent of deep learning. Section 4 details the methodology, focusing on cutting-edge training strategies, advancements in architectural designs, and fusion techniques in hybrid models integrating depth and semantic information. Finally, Section 5 summarizes findings, discusses implications, and suggests future research directions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105732"},"PeriodicalIF":4.2,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explicit Semantic Alignment Network for RGB-T salient object detection with Hierarchical Cross-Modal Fusion","authors":"Hongkuan Wang, Qingxi Yu, Zhenguang Di, Gang Yang","doi":"10.1016/j.imavis.2025.105730","DOIUrl":"10.1016/j.imavis.2025.105730","url":null,"abstract":"<div><div>Existing RGB-T salient object detection methods primarily rely on the learning mechanism of neural networks to perform implicit cross-modal feature alignment, aiming to achieve complementary fusion of modal features. However, this implicit feature alignment method has two main limitations: first, it is prone to causing loss of the salient object’s structural information; second, it may lead to abnormal activation responses that are not related to the object. To address the above issues, we propose the innovative Explicit Semantic Alignment (ESA) framework and design the Explicit Semantic Alignment Network for RGB-T Salient Object Detection with Hierarchical Cross-Modal Fusion (ESANet). Specifically, we design a Saliency-Aware Refinement Module (SARM), which fuses high-level semantic features with mid-level spatial details through cross-aggregation and the dynamic integration module to achieve bidirectional interaction and adaptive fusion of cross-modal features. It also utilizes a cross-modal multi-head attention mechanism to generate fine-grained shared semantic information. Subsequently, the Cross-Modal Feature Alignment Module (CFAM) introduces a window-based attention propagation mechanism, which enforces consistency in scene understanding between RGB and thermal modalities by using shared semantics as an alignment constraint. Finally, the Semantic-Guided Edge Sharpening Module (SESM) combines shared semantics with a weight enhancement strategy to optimize the consistency of shallow cross-modal feature distributions. Experimental results demonstrate that ESANet significantly outperforms existing state-of-the-art RGB-T salient object detection methods on three public datasets, validating its excellent performance in salient object detection tasks. Our code will be released at <span><span>https://github.com/whklearn/ESANet.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105730"},"PeriodicalIF":4.2,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating explainable AI with synthetic biometric data for enhanced image synthesis and privacy in computer vision systems","authors":"Hamad Aldawsari , Saad Alammar","doi":"10.1016/j.imavis.2025.105726","DOIUrl":"10.1016/j.imavis.2025.105726","url":null,"abstract":"<div><div>Integrating Explainable AI (XAI) with synthetic biometric data improves image synthesis and privacy in computer vision systems by generating high-quality images while ensuring interpretability. This integration enhances trust and transparency in AI-driven biometric applications. However, traditional biometric data collection methods face challenges such as privacy risks, data scarcity, biases, and regulatory constraints, limiting their effectiveness in authentication and identity verification. To address these limitations, we propose a Generative Adversarial Networks with Explainable AI (GAN-EAI) framework for privacy-preserving biometric image synthesis. This framework utilizes GANs to generate high-fidelity synthetic biometric images while incorporating XAI techniques to interpret and validate the generated outputs, ensuring fairness, robustness, and bias mitigation. The proposed method enables secure, privacy-conscious biometric image synthesis, making it suitable for applications in authentication, healthcare, and identity verification. By leveraging explainability, it ensures that the model's decision-making process is interpretable, reducing the risk of biased or adversarial outputs. Experimental results demonstrate that GAN-EAI achieves superior image quality, enhances privacy protection, and reduces bias in synthetic biometric datasets, making it a reliable solution for real-world biometric applications. This research highlights the potential of integrating explainability with generative models to advance privacy-preserving AI in computer vision.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105726"},"PeriodicalIF":4.2,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syamantak Sarkar , Revoti P. Bora , Bhupender Kaushal , Sudhish N. George , Kiran Raja
{"title":"Assessing the noise robustness of Class Activation Maps: A framework for reliable model interpretability","authors":"Syamantak Sarkar , Revoti P. Bora , Bhupender Kaushal , Sudhish N. George , Kiran Raja","doi":"10.1016/j.imavis.2025.105717","DOIUrl":"10.1016/j.imavis.2025.105717","url":null,"abstract":"<div><div>Class Activation Maps (CAMs) are one of the important methods for visualizing regions used by deep learning models. Yet their robustness to different noise remains underexplored. In this work, we evaluate and report the resilience of various CAM methods for different noise perturbations across multiple architectures and datasets. By analyzing the influence of different noise types on CAM explanations, we assess the susceptibility to noise and the extent to which dataset characteristics may impact explanation stability. The findings highlight considerable variability in noise sensitivity for various CAMs. We propose a robustness metric for CAMs that captures two key properties: consistency and responsiveness. Consistency reflects the ability of CAMs to remain stable under input perturbations that do not alter the predicted class, while responsiveness measures the sensitivity of CAMs to changes in the prediction caused by such perturbations. The metric is evaluated empirically across models, different perturbations, and datasets along with complementary statistical tests to exemplify the applicability of our proposed approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105717"},"PeriodicalIF":4.2,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Few-sample video captioning using pre-trained language model with gated bidirectional fusion","authors":"Tao Wang , Ping Li , Zeyu Pan , Hao Wang","doi":"10.1016/j.imavis.2025.105723","DOIUrl":"10.1016/j.imavis.2025.105723","url":null,"abstract":"<div><div>Video captioning generates sentences for describing the video content. Previous works mostly rely on a large number of video samples for training the model, but annotating them is very costly, thus limiting the widespread application of video captioning. This motivates us to explore the way of using only a few labeled samples to describe the video, and propose a few-sample video captioning method by adopting the <strong>P</strong>re-trained language model with <strong>G</strong>ated <strong>B</strong>idirectional <strong>F</strong>usion (PGBF). In particular, we design a triple dynamic gating module that dynamically adjusts the contributions of appearance, motion, and text information to leverage the linguistic knowledge from pre-trained language model. Meanwhile, we develop a bidirectional fusion module to fuse appearance-text features and motion-text features to learn better cross-modal features. Moreover, we introduce a semantic contrastive loss to minimize the gap between visual features (i.e., appearance, motion, and the fused one) and text features (i.e., parsed nouns, verbs and whole sentence). Extensive experiments on three popular benchmarks demonstrate that our method achieves promising video captioning performance by using only a few training samples.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105723"},"PeriodicalIF":4.2,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145019089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-attention enhanced dynamic semantic multi-scale graph convolutional network for skeleton-based action recognition","authors":"Shihao Liu, Cheng Xu, Songyin Dai, Nuoya Li, Weiguo Pan, Bingxin Xu, Liu Hongzhe","doi":"10.1016/j.imavis.2025.105725","DOIUrl":"10.1016/j.imavis.2025.105725","url":null,"abstract":"<div><div>Skeleton-based action recognition has attracted increasing attention due to its efficiency and robustness in modeling human motion. However, existing graph convolutional approaches often rely on predefined topologies and struggle to capture high-level semantic relations and long-range dependencies. Meanwhile, transformer-based methods, despite their effectiveness in modeling global dependencies, typically overlook local continuity and impose high computational costs. Moreover, current multi-stream fusion strategies commonly ignore low-level complementary cues across modalities. To address these limitations, we propose SAD-MSNet, a Self-Attention enhanced Multi-Scale dynamic semantic graph convolutional network. SAD-MSNet integrates a region-aware multi-scale skeleton simplification strategy to represent actions at different levels of abstraction. It employs a semantic-aware spatial modeling module that constructs dynamic graphs based on node types, edge types, and topological priors, further refined by channel-wise attention and adaptive fusion. For temporal modeling, the network utilizes a six-branch structure that combines standard causal convolution, dilated joint-guided temporal convolutions with varying dilation rates, and a global pooling branch, enabling it to effectively capture both short-term dynamics and long-range temporal semantics. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and N-UCLA demonstrate that SAD-MSNet achieves superior performance compared to state-of-the-art methods, while maintaining a compact and interpretable architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105725"},"PeriodicalIF":4.2,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145019088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxin Chen , Dong Xing , Mohammad Shabaz , Yongpei Zhu , Yong Wang , Xianxun Zhu
{"title":"DNLN: Image super-resolution with Deformable Non-Local attention and Multi-Branch Weighted Feature Fusion","authors":"Jiaxin Chen , Dong Xing , Mohammad Shabaz , Yongpei Zhu , Yong Wang , Xianxun Zhu","doi":"10.1016/j.imavis.2025.105721","DOIUrl":"10.1016/j.imavis.2025.105721","url":null,"abstract":"<div><div>Single image super-resolution (SISR) aims to recover a high-resolution image from a low-resolution input. Despite recent advancements, existing methods often fail to fully exploit self-similarities across image scales. In this paper, we introduce the Deformable Non-Local (D-NL) attention module, integrated into a recurrent neural network. The D-NL attention mechanism leverages deformable convolutions to better capture pixel-wise correlations and long-range self-similarities. Additionally, we propose a Multi-Scale Channel Attention Module (MS-CAM) and a Multi-Branch Weighted Feature Fusion (MWFF) cell to enhance feature fusion, effectively identifying and combining features with distinct semantics and scales. Experimental results on benchmark datasets demonstrate that our approach, DNLN, significantly outperforms state-of-the-art methods, underscoring the effectiveness of exploiting long-range self-similarities for SISR.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105721"},"PeriodicalIF":4.2,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liangduan Wu , Yan Zhuang , Guoliang Liao , Lin Han , Zhan Hua , Rui Wang , Ke Chen , Jiangli Lin
{"title":"Breast tumor detection in ultrasound images with anatomical prior knowledge","authors":"Liangduan Wu , Yan Zhuang , Guoliang Liao , Lin Han , Zhan Hua , Rui Wang , Ke Chen , Jiangli Lin","doi":"10.1016/j.imavis.2025.105724","DOIUrl":"10.1016/j.imavis.2025.105724","url":null,"abstract":"<div><div>Breast tumor detection is an important step in the procedure of computer-aided diagnosis. In clinical practice, computer-aided diagnosis system not only processes lesion images but also processes normal images without lesions. However, normal images are often overlooked. In this study, we additionally collected numerous normal images to evaluate object detection algorithms. We found that similarities between tumors and hypoechoic regions have led to false positive lesions, and the frequency of false positive lesions in normal images is higher than it in lesion images. To address this issue, we incorporate anatomical prior knowledge of breast tumors to propose a novel breast tumor detection method. Our method consists of a preprocessing method and a novel breast tumor detection network. The preprocessing method automatically extracts breast regions as anatomical constraints and utilizes channel fusion to combine images of breast regions with original images. The proposed breast tumor detection network is based on programmable gradient information and large-kernel convolution. The programmable gradient information is applied by an auxiliary branch which provides more comprehensive gradient information for backpropagation, while large-kernel convolution expands the receptive field of neurons. As a result, our method achieves the best false positive lesion rate of 3.30% and gets a reduction by at least 5.67% over other compared algorithms in normal images, with the best precision of 90.91%, sensitivity of 88.57%, f1-score of 89.75%, and mean average precision of 93.07% in lesion images. Experimental results suggest promising application potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105724"},"PeriodicalIF":4.2,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145026589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zenghui Wang , Wenhao Song , Xuening Xing , Lina Liu , Xianxun Zhu , Mingliang Gao
{"title":"A Dual-branch Progressive Network with spatial-frequency constraint for image fusion","authors":"Zenghui Wang , Wenhao Song , Xuening Xing , Lina Liu , Xianxun Zhu , Mingliang Gao","doi":"10.1016/j.imavis.2025.105709","DOIUrl":"10.1016/j.imavis.2025.105709","url":null,"abstract":"<div><div>Image fusion aims to integrate complementary information from source images to enhance the quality of fused representations. Most existing methods primarily impose pixel-level constraints in the spatial domain, which limits their ability to preserve frequency domain information. Furthermore, single-branch networks typically process source image features uniformly, which hinders cross-modal feature consideration. To address these challenges, we propose a Dual-branch Progressive Network (DPNet) for image fusion. First, a global feature fusion branch is constructed to enhance the extraction of long-range dependencies. This branch promotes global feature interaction through a Global Context Awareness (GCA) module. Subsequently, a local feature fusion branch is designed to extract local information from source images, which comprises multiple Local Feature Attention (LFA) modules to capture valuable local features. Additionally, to preserve both frequency and spatial domain information, we integrate two loss functions that jointly optimize feature retention in both domains. Experimental results on five datasets demonstrate that DPNet surpasses state-of-the-art fusion models both qualitatively and quantitatively. These findings validate its effectiveness for practical applications in military surveillance, environmental monitoring and medical imaging. The code is available at <span><span>https://github.com/zenghui11/DPNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105709"},"PeriodicalIF":4.2,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144932115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}