{"title":"D2PCFN: Dual domain progressive cross-fusion network for remote sensing image pansharpening","authors":"Biyun Xu , Yan Zheng , Suleman Mazhar , Zhenghua Huang","doi":"10.1016/j.cviu.2025.104525","DOIUrl":"10.1016/j.cviu.2025.104525","url":null,"abstract":"<div><div>High-resolution multispectral (HRMS) image generation through pansharpening requires effective integration of spatial details from panchromatic (PAN) images and spectral information from low-resolution multispectral (LRMS) images. Existing methods often overlook interactions between deep features across different depths and modalities, resulting in spectral distortion and loss of spatial detail. To address this, we propose a dual domain progressive cross-fusion network (D2PCFN) that progressively integrates features in both spatial and frequency domains. The network consists of a dual-branch feature generation module (DBFGM) for deep feature extraction, a dual domain cross-fusion module (D2CFM) for cross-interaction between spatial and frequency representations, and a deep feature reconstruction module (DFRM) for synthesizing high-quality outputs. Extensive experiments on GaoFen-2, QuickBird, WorldView-3, and WorldView-2 datasets demonstrate that our method achieves state-of-the-art accuracy, with average gains of 1.77% in SAM, 1.70% in ERGAS, 0.89% in PSNR, and 1.37% in HQNR over leading methods. Both quantitative and qualitative results confirm the effectiveness and generalization ability of the proposed D2PCFN. Source code will also be shared on <span><span>https://github.com/MysterYxby/D2PCFN</span><svg><path></path></svg></span>-website link after publication.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104525"},"PeriodicalIF":3.5,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenyi Zhang, Haoran Zhang, Xisheng Zhang, Xiaohua Shen, Lejun Zou
{"title":"IP-CAM: Class activation mapping based on importance weights and principal-component weights for better and simpler visual explanations","authors":"Wenyi Zhang, Haoran Zhang, Xisheng Zhang, Xiaohua Shen, Lejun Zou","doi":"10.1016/j.cviu.2025.104523","DOIUrl":"10.1016/j.cviu.2025.104523","url":null,"abstract":"<div><div>Visual explanations of deep neural networks (DNNs) have gained considerable importance in deep learning due to the lack of interpretability, which constrains human trust in DNNs. This paper proposes a new gradient-free class activation map (CAM) architecture called importance principal-component CAM (IP-CAM). The architecture not only improves the prediction accuracy of networks but also provides simpler and more reliable visual explanations. It adds importance weight layers before the classifier and assigns an importance weight to each activation map. After fine-tuning, it selects images with the highest prediction score for each class, performs principal component analysis (PCA) on activation maps of all channels, and regards the eigenvector of the first principal component as principal-component weights for that class. The final saliency map is obtained by linearly combining the activation maps, importance weights and principal-component weights. IP-CAM is evaluated on the ILSVRC 2012 dataset and RSD46-WHU dataset, whose results show that IP-CAM performs better than most previous CAM variants in recognition and localization tasks. Finally, the method is applied as a tool for interpretability, and the results illustrate that IP-CAM effectively unveils the decision-making process of DNNs through saliency maps.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104523"},"PeriodicalIF":3.5,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145268593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden
{"title":"Gloss-free Sign Language Translation: An unbiased evaluation of progress in the field","authors":"Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden","doi":"10.1016/j.cviu.2025.104498","DOIUrl":"10.1016/j.cviu.2025.104498","url":null,"abstract":"<div><div>Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here<span><span><sup>1</sup></span></span> to support transparency and reproducibility in SLT research.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104498"},"PeriodicalIF":3.5,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145268591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSBATN: Multi-Stage Boundary-Aware Transformer Network for action segmentation in untrimmed surgical videos","authors":"Rezowan Shuvo, M.S. Mekala, Eyad Elyan","doi":"10.1016/j.cviu.2025.104521","DOIUrl":"10.1016/j.cviu.2025.104521","url":null,"abstract":"<div><div>Understanding actions within surgical workflows is critical for evaluating post-operative outcomes and enhancing surgical training and efficiency. Capturing and analysing long sequences of actions in surgical settings is challenging due to the inherent variability in individual surgeon approaches, which are shaped by their expertise and preferences. This variability complicates the identification and segmentation of distinct actions with ambiguous boundary start and end points. The traditional models, such as MS-TCN, which rely on large receptive fields, cause over-segmentation or under-segmentation, where distinct actions are incorrectly aligned. To address these challenges, we propose the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention to improve action segmentation. Our approach effectively manages the complexity of varying action durations and subtle transitions by accurately identifying start and end action boundaries in untrimmed surgical videos. MSBATN introduces a novel unified loss function that optimises action classification and boundary detection as interconnected tasks. Unlike conventional binary boundary detection methods, our innovative boundary weighing mechanism leverages contextual information to precisely identify action boundaries. Extensive experiments on three challenging surgical datasets demonstrate that MSBATN achieves state-of-the-art performance, with superior F1 scores at 25% and 50% thresholds and competitive results across other metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104521"},"PeriodicalIF":3.5,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145268592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SVFFNet: A Scale-Aware Voxel Flow Fusion Network for video prediction","authors":"Yao Zhou , Jinpeng Wei , Xueyong Zhang , Yusong Zhai , Jian Wei","doi":"10.1016/j.cviu.2025.104520","DOIUrl":"10.1016/j.cviu.2025.104520","url":null,"abstract":"<div><div>Video prediction is a challenging task due to the potential for various motion scales in the complex scene. The diversity of motion scales stems from the time-variant and object-dependent motion magnitudes, as well as the multiple image resolutions across datasets. However, the vast majority of frame forecasting networks do not distinguish between treatment of different motion scales. Therefore, their receptive field is normally insufficient to capture larger-scale motions. Those that do, often yield significant local distortions in the predicted images. The reasons lie in their fixed choice of scale factors and lack of cross-scale interaction between motion features. In this work, we propose a Scale-Aware Voxel Flow Fusion Network (SVFFNet) to address the motion scale inconsistency problem and fully integrate multi-scale feature. This network consists of a set of flow estimation blocks, each block containing a selector module and a fusion module. The selector module adaptively selects the appropriate scale-processing branch for the input frames, thus facilitating acquisition of more refined features for large-scale motion. The fusion module then combines these features with the original motion information via an attention mechanism, preserving the actually existing structural details. Experimental results on four widely used benchmark datasets demonstrate that our method outperforms previously published baselines for video prediction. The code is available at: <span><span>https://github.com/zyaojlu/SVFFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104520"},"PeriodicalIF":3.5,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145268590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalization-preserving adaptation of vision-language models for open-vocabulary segmentation","authors":"Zhen Chen, Hao Tang, Shiliang Zhang","doi":"10.1016/j.cviu.2025.104518","DOIUrl":"10.1016/j.cviu.2025.104518","url":null,"abstract":"<div><div>Recent progress in large-scale Vision-Language Models (VLMs) has significantly advanced open-vocabulary segmentation. Previous works typically either generate class-agnostic masks and classify them with frozen VLMs, or align the mask generator features with VLM text features. These approaches face challenges of weak spatial discrimination ability of frozen VLMs and poor generalization due to unreliable vision-language alignment. This paper introduces a novel Generalization-Preserving Adaptation (GPA) of VLMs for open-vocabulary segmentation. GPA enhances the spatial discrimination capability of pre-trained VLMs through an efficient fine-tuning scheme, which incorporates a spatial adaptation module comprising spatial dependency modeling and low-rank feature modulation for preserving the feature space. Additionally, GPA proposes a context-aware feature aggregation module to extract mask features better aligned with the VLM features for mask classification. It performs decoupled context modeling that generates object-agnostic contextualized feature map and object-specific classification maps for accentuating discriminative and contextual clues. By maintaining the original VLM feature distribution for vision-language alignment, GPA effectively preserves the generalization capabilities of VLMs while enhancing segmentation performance. Extensive experiments on multiple open-vocabulary panoptic and semantic segmentation benchmarks demonstrate both superior effectiveness and generalization capabilities compared to previous works.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104518"},"PeriodicalIF":3.5,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synergistic dual and efficient additive attention network for No-Reference Image Quality Assessment","authors":"Zhou Fang, Baiming Feng, Ning Li","doi":"10.1016/j.cviu.2025.104516","DOIUrl":"10.1016/j.cviu.2025.104516","url":null,"abstract":"<div><div>No-Reference Image Quality Assessment (NR-IQA) aims to evaluate the perceptual quality of images in alignment with human subjective judgments. However, most existing NR-IQA methods, while striving for high accuracy, often neglect computational complexity. To address this challenge, we propose a Synergistic Spatial and Channel and Efficient Additive Attention Network for NR-IQA. In our approach, we first employ a feature extraction module to derive features rich in both distortion and semantic information. Subsequently, we introduce a spatial-channel synergistic attention mechanism to enhance feature representations across spatial and channel dimensions. This attention module focuses on the most salient regions of the image and modulates feature responses accordingly, enabling the network to emphasize critical distortions and semantic features pertinent to perceptual quality assessment. Specifically, the spatial attention mechanism identifies significant regions that substantially contribute to quality perception, while the channel attention mechanism adjusts the importance of each feature channel, ensuring effective utilization of spatial and channel-specific information. Furthermore, to enhance the model’s robustness, we incorporate an Efficient Additive Attention mechanism alongside a Multi-scale Feed-forward Network, designed to reduce computational costs without compromising performance. Finally, a dual-branch structure for patch-weighted quality prediction is employed to derive the final quality score based on the weighted scores of individual patches. Extensive experimental evaluations on four widely used benchmark datasets demonstrate that the proposed method surpasses several state-of-the-art NR-IQA approaches in both performance and computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104516"},"PeriodicalIF":3.5,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huy Nguyen, Kien Nguyen, Akila Pemasiri, Sridha Sridharan, Clinton Fookes
{"title":"Beyond geometry: The power of texture in interpretable 3D person ReID","authors":"Huy Nguyen, Kien Nguyen, Akila Pemasiri, Sridha Sridharan, Clinton Fookes","doi":"10.1016/j.cviu.2025.104517","DOIUrl":"10.1016/j.cviu.2025.104517","url":null,"abstract":"<div><div>This paper presents FusionTexReIDNet, a robust framework for 3D person re-identification that uniquely leverages UVTexture to enhance both performance and explainability. Unlike existing 3D person ReID approaches that simply overlay textures on point clouds, our method exploits the full potential of UVTexture through its high resolution and normalized coordinate properties. The framework consists of two main streams: a UVTexture stream that processes appearance features and a 3D stream that handles geometric information. These streams are fused through an effective combination of KNN, attribute-based, and explainable re-ranking strategies. Our approach introduces explainability to 3D person ReID through the visualization of activation maps on UVTextures, providing insights into the model’s decision-making process by highlighting discriminative regions. By incorporating the Intersection-Alignment Score derived from activation maps and visible clothing masks, we further improve the ReID accuracy. Extensive experiments demonstrate that FusionTexReIDNet achieves state-of-the-art performance across various scenarios, with Rank-1 accuracies of 98.5% and 89.7% Rank-1 on benchmark datasets, while providing interpretable results through its explainable component.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104517"},"PeriodicalIF":3.5,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145221698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fengyuan Liu , Zhongjian Hu , Peng Yang , Xingyu Liu
{"title":"Iterative Caption Generation with Heuristic Guidance for enhancing knowledge-based visual question answering","authors":"Fengyuan Liu , Zhongjian Hu , Peng Yang , Xingyu Liu","doi":"10.1016/j.cviu.2025.104515","DOIUrl":"10.1016/j.cviu.2025.104515","url":null,"abstract":"<div><div>The advent of large language models (LLMs) has significantly advanced Knowledge-based Visual Question Answering (KBVQA) by reducing the reliance on external knowledge bases. Traditional methods often generate captions in a single pass, which can struggle with complex questions due to difficulty in precisely identifying key visual components. This challenge undermines the reasoning capabilities of LLMs, which require accurate, semantically aligned captions to answer complex questions effectively. To address this limitation, we propose ICGHG <strong><u>I</u></strong>terative <strong><u>C</u></strong>aption <strong><u>G</u></strong>eneration with <strong><u>H</u></strong>euristic <strong><u>G</u></strong>uidance, a novel framework that refines captions iteratively. Our approach incorporates a dynamic loop where captions are continuously refined based on heuristic feedback from a set of candidate answers and the question itself, ensuring that the final caption provides accurate semantic alignment with both the visual content and the question. By leveraging this iterative process, ICGHG mitigates common issues such as hallucinations and improves the quality of the generated captions. Extensive experiments on OK-VQA, A-OKVQA, and FVQA datasets demonstrate that ICGHG significantly outperforms existing methods, achieving 57.5%, 60.2%, and 69.4% accuracy on their respective test sets, setting new benchmarks in KBVQA accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104515"},"PeriodicalIF":3.5,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KD-Mamba: Selective state space models with knowledge distillation for trajectory prediction","authors":"Shaokang Cheng , Sourav Das , Shiru Qu , Lamberto Ballan","doi":"10.1016/j.cviu.2025.104499","DOIUrl":"10.1016/j.cviu.2025.104499","url":null,"abstract":"<div><div>Trajectory prediction is a key component of intelligent mobility systems and human–robot interaction. The inherently stochastic nature of human behavior, coupled with external environmental influences, poses significant challenges for long-term prediction. However, existing approaches struggle to effectively model spatial interactions and accurately predict long-term destinations, while their high computational demands limit real-world applicability. To address these limitations, this paper presents KD-Mamba, the Selective State Space Models with Knowledge Distillation for trajectory prediction. The model incorporates the U-CMamba module, which features a U-shaped encoder–decoder architecture. By integrating convolutional neural networks (CNN) with the Mamba mechanism, this module effectively captures local spatial interactions and global contextual information of human motion patterns. Subsequently, we introduce a Bi-Mamba module, which captures long-term dependencies in human movement, ensuring a more accurate representation of trajectory dynamics. Knowledge distillation strengthens both modules by facilitating knowledge transfer across diverse scenarios. Compared to transformer-based approaches, KD-Mamba reduces computational complexity from quadratic to linear. Extensive experimental results from two real-world trajectory datasets indicate that KD-Mamba outperforms the existing mainstream baselines. The proposed method provides insights into the application of trajectory prediction in human-in-the-loop assistive systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104499"},"PeriodicalIF":3.5,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145109497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}