{"title":"ADPNet: Attention-Driven Dual-Path Network for automated polyp segmentation in colonoscopy","authors":"Mukhtiar Khan , Inam Ullah , Nadeem Khan , Sumaira Hussain , Muhammad ILyas Khattak","doi":"10.1016/j.imavis.2025.105648","DOIUrl":"10.1016/j.imavis.2025.105648","url":null,"abstract":"<div><div>Accurate automated polyp segmentation in colonoscopy images is crucial for early colorectal cancer detection and treatment, a major global health concern. Effective segmentation aids clinical decision-making and surgical planning. Leveraging advancements in deep learning, we propose an Attention-Driven Dual-Path Network (ADPNet) for precise polyp segmentation. ADPNet features a novel architecture with a specialized bridge integrating the Atrous Self-Attention Pyramid Module (ASAPM) and Dilated Convolution-Transformer Module (DCTM) between the encoder and decoder, enabling efficient feature extraction, long-range dependency capture, and enriched semantic representation. The decoder employs pixel shuffle, gated attention mechanisms, and residual blocks to enhance contextual and spatial feature refinement, ensuring precise boundary delineation and noise suppression. Comprehensive evaluations on public polyp datasets show ADPNet outperforms state-of-the-art models, demonstrating superior accuracy and robustness, particularly in challenging scenarios such as small or concealed polyps. ADPNet offers a robust solution for automated polyp segmentation, with potential to revolutionize early colorectal cancer detection and improve clinical outcomes. The code and results of this article are publicly available at https://github.com/Mkhan143/ADPNet.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105648"},"PeriodicalIF":4.2,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144885727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Invariant prompting with classifier rectification for continual learning","authors":"Chunsing Lo , Hao Zhang , Andy J. Ma","doi":"10.1016/j.imavis.2025.105641","DOIUrl":"10.1016/j.imavis.2025.105641","url":null,"abstract":"<div><div>Continual learning aims to train a model capable of continuously learning and retaining knowledge from a sequence of tasks. Recently, prompt-based continual learning has been proposed to leverage the generalization ability of a pre-trained model with task-specific prompts for instruction. Prompt component training is a promising approach to enhancing the plasticity for prompt-based continual learning. Nevertheless, this approach changes the instructed features to be noisy for query samples from the old tasks. Additionally, the problem of scale misalignment in classifier logits between different tasks leads to misclassification. To address these issues, we propose an invariant Prompting with Classifier Rectification (iPrompt-CR) method for prompt-based continual learning. In our method, the learnable keys corresponding to each new-task component are constrained to be orthogonal to the query prototype in the old tasks for invariant prompting, which reduces feature representation noise. After prompt learning, instructed features are sampled from Gaussian-distributed prototypes for classifier rectification with unified logit scale for more accurate predictions. Extensive experimental results on four benchmark datasets demonstrate that our method outperforms the state of the arts in both class-incremental learning and more realistic general incremental learning scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105641"},"PeriodicalIF":4.2,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongchang Zhang, Longtao Wang, Qizhan Zou, Juan Zeng
{"title":"DFF-Net: Deep Feature Fusion Network for low-light image enhancement","authors":"Hongchang Zhang, Longtao Wang, Qizhan Zou, Juan Zeng","doi":"10.1016/j.imavis.2025.105645","DOIUrl":"10.1016/j.imavis.2025.105645","url":null,"abstract":"<div><div>Low-light image enhancement methods are designed to improve brightness, recover texture details, restore color fidelity and suppress noise in images captured in low-light environments. Although many low-light image enhancement methods have been proposed, existing methods still face two limitations: (1) the inability to achieve all these objectives at the same time; and (2) heavy reliance on supervised methods that limits practical applicability in real-world scenarios. To overcome these challenges, we propose a Deep Feature Fusion Network (DFF-Net) for low-light image enhancement which builds upon Zero-DCE’s light-enhancement curve. The network is trained without requiring any paired datasets through a set of carefully designed non-reference loss functions. Furthermore, we develop a Fast Deep-level Residual Block (FDRB) to strengthen DFF-Net’s performance, which demonstrates superior performance in both feature extraction and computational efficiency. Comprehensive quantitative and qualitative experiments demonstrate that DFF-Net achieves superior performance in both subjective visual quality and downstream computer vision tasks. In low-light image enhancement experiments, DFF-Net achieves either optimal or sub-optimal metrics across all six public datasets compared to other unsupervised methods. And in low-light object detection experiments, DFF-Net achieves maximum scores in four key metrics on the ExDark dataset: P at 83.3%, F1 at 72.8%, mAP50 at 74.9%, and mAP50-95 at 48.9%. Code is available at <span><span>https://github.com/WangL0ngTa0/DFF-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105645"},"PeriodicalIF":4.2,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACMC: Adaptive cross-modal multi-grained contrastive learning for continuous sign language recognition","authors":"Xu-Hua Yang, Hong-Xiang Hu, XuanYu Lin","doi":"10.1016/j.imavis.2025.105622","DOIUrl":"10.1016/j.imavis.2025.105622","url":null,"abstract":"<div><div>Continuous sign language recognition helps the hearing-impaired community participate in social communication by recognizing the semantics of sign language video. However, the existing CSLR methods usually only implement cross-modal alignment at the sentence level or frame level, and do not fully consider the potential impact of redundant frames and semantically independent gloss identifiers on the recognition results. In order to improve the limitations of the above methods, we propose an adaptive cross-modal multi-grained contrastive learning (ACMC) for continuous sign language recognition, which achieve more accurate cross-modal semantic alignment through a multi-grained contrast mechanism. First, the ACMC uses the frame extractor and the temporal modeling module to obtain the fine-grained and coarse-grained features of the visual modality in turn, and extracts the fine-grained and coarse-grained features of the text modality through the CLIP text encoder. Then, the ACMC adopts coarse-grained contrast and fine-grained contrast methods to effectively align the features of visual and text modalities from global and local perspectives, and alleviate the semantic interference caused by redundant frames and semantically independent gloss identifiers through cross-grained contrast. In addition, in the video frame extraction stage, we design an adaptive learning module to strengthen the features of key regions of video frames through the calculated discrete spatial feature decision matrix, and adaptively fuse the convolution features of key frames with the trajectory information between adjacent frames, thereby reducing the computational cost. Experimental results show that the proposed ACMC model achieves very competitive recognition results on sign language datasets such as PHOENIX14, PHOENIX14-T and CSL-Daily.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105622"},"PeriodicalIF":4.2,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BSMEF: Optimized multi-exposure image fusion using B-splines and Mamba","authors":"Jinyong Cheng , Qinghao Cui , Guohua Lv","doi":"10.1016/j.imavis.2025.105660","DOIUrl":"10.1016/j.imavis.2025.105660","url":null,"abstract":"<div><div>In recent years, multi-exposure image fusion has been widely applied to process overexposed or underexposed images due to its simplicity, effectiveness, and low cost. With the development of deep learning techniques, related fusion methods have been continuously optimized. However, retaining global information from source images while preserving fine local details remains challenging, especially when fusing images with extreme exposure differences, where boundary transitions often exhibit shadows and noise. To address this, we propose a multi-exposure image fusion network model, BSMEF, based on B-Spline basis functions and Mamba. The B-Spline basis function, known for its smoothness, reduces edge artifacts and enables smooth transitions between images with varying exposure levels. In BSMEF, the feature extraction module, combining B-Spline and deformable convolutions, preserves global features while effectively extracting fine-grained local details. Additionally, we design a feature enhancement module based on Mamba blocks, leveraging its powerful global perception ability to capture contextual information. Furthermore, the fusion module integrates three feature enhancement methods: B-Spline basis functions, attention mechanisms, and Fourier transforms, addressing shadow and noise issues at fusion boundaries and enhancing the focus on important features. Experimental results demonstrate that BSMEF outperforms existing methods across multiple public datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105660"},"PeriodicalIF":4.2,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144605700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge","authors":"Xing Liu , Hao Tang","doi":"10.1016/j.imavis.2025.105636","DOIUrl":"10.1016/j.imavis.2025.105636","url":null,"abstract":"<div><div>Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a <strong>B</strong>one-<strong>C</strong>hain enhanced <strong>D</strong>iffusion 3D pose estimation method, or <strong>BCDPose</strong>. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105636"},"PeriodicalIF":4.2,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinzhe Cui , Jing Liu , Ze Teng , Shuangfeng Yang , Hongfeng Li , Pingkang Li , Jiabin Lu , Yajuan Gao , Yun Peng , Hongbin Han , Wanyi Fu
{"title":"Multi-scale feature fusion with task-specific data synthesis for pneumonia pathogen classification","authors":"Yinzhe Cui , Jing Liu , Ze Teng , Shuangfeng Yang , Hongfeng Li , Pingkang Li , Jiabin Lu , Yajuan Gao , Yun Peng , Hongbin Han , Wanyi Fu","doi":"10.1016/j.imavis.2025.105662","DOIUrl":"10.1016/j.imavis.2025.105662","url":null,"abstract":"<div><div>Pneumonia pathogen diagnosis from chest X-rays (CXR) is essential for timely and effective treatment for pediatric patients. However, the radiographic manifestations of pediatric pneumonia are often less distinct than those in adults, challenging for pathogen diagnosis, even for experienced clinicians. In this work, we propose a novel framework that integrates an adaptive hierarchical fusion network (AHFF) with task-specific diffusion-based data synthesis for pediatric pneumonia pathogen classification in clinical CXR. AHFF consists of dual branches to extract global and local features, and an adaptive feature fusion module that hierarchically integrates semantic information using cross attention mechanisms. Further, we develop a classifier-guided diffusion model that uses the task-specific AHFF classifier to generate class-consistent chest X-ray images for data augmentation. Experiments on one private and two public datasets demonstrate that the proposed classification model achieves superior performance, with accuracies of 78.00%, 84.43%, and 91.73%, respectively. Diffusion-based augmentation further improves accuracy to 84.37% using the private dataset. These results highlight the potential of feature fusion and data synthesis for improving automated pathogen-specific pneumonia diagnosis in clinical settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105662"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MaxSwap-Enhanced Knowledge Consistency Learning for long-tailed recognition","authors":"Shengnan Fan, Zhilei Chai, Zhijun Fang, Yuying Pan, Hui Shen, Xiangyu Cheng, Qin Wu","doi":"10.1016/j.imavis.2025.105643","DOIUrl":"10.1016/j.imavis.2025.105643","url":null,"abstract":"<div><div>Deep learning has made significant progress in image classification. However, real-world datasets often exhibit a long-tailed distribution, where a few head classes dominate while many tail classes have very few samples. This imbalance leads to poor performance on tail classes. To address this issue, we propose MaxSwap-Enhanced Knowledge Consistency Learning which includes two core components: Knowledge Consistency Learning and MaxSwap for Confusion Suppression. Knowledge Consistency Learning leverages the outputs from different augmented views as soft labels to capture inter-class similarities and introduces a consistency constraint to enforce output consistency across different perturbations, which enables tail classes to effectively learn from head classes with similar features. To alleviate the bias towards head classes, we further propose a MaxSwap for Confusion Suppression to adaptively adjust the soft labels when the model makes incorrect predictions which mitigates overconfidence in incorrect predictions. Experimental results demonstrate that our method achieves significant improvements on long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and Places-LT, which validates the effectiveness of our approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105643"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiyue Sun , Yang Yang , Haoxuan Xu , Zezhou Li , Yunxia Liu , Hongjun Wang
{"title":"MG-KG: Unsupervised video anomaly detection based on motion guidance and knowledge graph","authors":"Qiyue Sun , Yang Yang , Haoxuan Xu , Zezhou Li , Yunxia Liu , Hongjun Wang","doi":"10.1016/j.imavis.2025.105644","DOIUrl":"10.1016/j.imavis.2025.105644","url":null,"abstract":"<div><div>Unsupervised Video Anomaly Detection (VAD) is a challenging and research-valuable task that is trained with only normal samples to detect anomalous samples. However, current solutions face two key issues: (1) a lack of spatio-temporal linkage in video data, and (2) limited interpretability of VAD results. To address these, we propose a new method named Motion Guidance-Knowledge Graph (MG-KG), inspired by video saliency detection and video understanding methods. Specifically, MG-KG has two components: the Motion Guidance Network (MGNet) and the Knowledge Graph retrieval for VAD (VAD-KG). MGNet emphasizes motion in the video foreground, crucial for real-time surveillance, while VAD-KG builds a knowledge graph to store structured video information and retrieve it during testing, enhancing interpretability. This combination improves both generalization and understanding in VAD for smart surveillance systems. Additionally, since training data has only normal samples, we propose a training baseline strategy, a tabu search strategy, and a score rectification strategy to enhance MG-KG for video anomaly detection tasks, which can further exploit the potential of MG-KG and significantly improve the performance of VAD. Extensive experiments demonstrate that MG-KG achieves competitive results in VAD for intelligent video surveillance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105644"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144665469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Composed image retrieval by Multimodal Mixture-of-Expert Synergy","authors":"Wenzhe Zhai , Mingliang Gao , Gwanggil Jeon , Qiang Zhou , David Camacho","doi":"10.1016/j.imavis.2025.105634","DOIUrl":"10.1016/j.imavis.2025.105634","url":null,"abstract":"<div><div>Composed image retrieval (CIR) is essential in security surveillance, e-commerce, and social media analysis. It provides precise information retrieval and intelligent analysis solutions for various industries. The majority of existing CIR models create a pseudo-word token from the reference image, which is subsequently incorporated into the corresponding caption for the image retrieval task. However, these pseudo-word-based prompting approaches are limited when the target image entails complex modifications to the reference image, such as object removal and attribute changes. To address the issue, we propose a Multimodal Mixture-of-Expert Synergy (MMES) model to achieve effective composed image retrieval. The MMES model initially utilizes multiple Mixture of Expert (MoE) modules through the mixture expert unit to process various types of multimodal input data. Subsequently, the outputs from these expert models are fused through the cross-modal integration module. Furthermore, the fused features generate implicit text embedding prompts, which are concatenated with the relative descriptions. Finally, retrieval is conducted using a text encoder and an image encoder. The Experiments demonstrate that the proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105634"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}