{"title":"Burst image super-resolution via multi-cross attention encoding and multi-scan state-space decoding","authors":"Tengda Huang, Yu Zhang, Tianren Li, Yufu Qu, Fulin Liu, Zhenzhong Wei","doi":"10.1016/j.imavis.2025.105773","DOIUrl":"10.1016/j.imavis.2025.105773","url":null,"abstract":"<div><div>Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Most existing methods use fixed and narrow attention windows, limiting feature perception and hindering alignment and aggregation. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105773"},"PeriodicalIF":4.2,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EDFusion: Edge-guided attention and dynamic receptive field with dense residual for multi-focus image fusion","authors":"Hao Zhai, Zhendong Xu, Zhi Zeng, Lei Yu, Bo Lin","doi":"10.1016/j.imavis.2025.105763","DOIUrl":"10.1016/j.imavis.2025.105763","url":null,"abstract":"<div><div>Multi-focus image fusion (MFIF) synthesizes a fully focused image by integrating multiple partially focused images captured at distinct focal planes of the same scene. However, existing methods often fall short in preserving edge and texture details. To address this issue, this paper proposes a network for multi-focus image fusion that incorporates edge-guided attention and dynamic receptive field dense residuals. The network employs a specially designed dynamic receptive field dense residual block (DRF-DRB) to achieve adaptive multi-scale feature extraction, providing rich contextual information for subsequent fine fusion. Building on this, an edge-guided fusion module (EGFM) explicitly leverages the differences in source images as edge priors to generate dedicated weight maps for each feature channel, enabling precise boundary preservation. To efficiently model global dependencies, we introduce a multi-scale token mixing transformer (MSTM-Transformer), designed to reduce computational complexity while enhancing cross-scale semantic interactions. Finally, a refined multi-scale context upsampling module (MSCU) reconstructs high-frequency details. Experiments were conducted on five public datasets, comparing against twelve state-of-the-art methods and evaluated using nine metrics. Both quantitative and qualitative results demonstrate that the proposed method significantly outperforms existing approaches in fusion performance. Notably, on the Lytro dataset, the proposed method ranked first across eight core metrics, achieving high scores of 1.1946 in the information preservation metric (<span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>N</mi><mi>M</mi><mi>I</mi></mrow></msub></math></span>) and 0.7629 in the edge information fidelity metric (<span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>A</mi><mi>B</mi><mo>/</mo><mi>F</mi></mrow></msub></math></span>).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105763"},"PeriodicalIF":4.2,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniele Venturini , Marco Raoul Marini , Luigi Cinque , Gian Luca Foresti
{"title":"Leveraging spatial-channel attention in U-Net for enhanced segmentation of martian dust storms","authors":"Daniele Venturini , Marco Raoul Marini , Luigi Cinque , Gian Luca Foresti","doi":"10.1016/j.imavis.2025.105754","DOIUrl":"10.1016/j.imavis.2025.105754","url":null,"abstract":"<div><div>Automated detection of Martian dust storms is critical for analyzing planetary climate dynamics, yet segmentation remains challenging due to diffuse storm boundaries and data artifacts. This study presents a Convolutional Block Attention Module-enhanced (CBAM-enhanced) U-Net architecture for dust storm segmentation using Mars Reconnaissance Orbiter (MRO) MARCI Mars Daily Global Maps (MDGMs) from the Mars Dust Activity Database (MDAD v1.1). The approach combines attention-driven feature refinement with class-imbalance mitigation and a patching strategy to handle missing data in global maps. The model achieves 0.6502 Intersection over Union (IoU) and 0.6883 Dice scores on MDAD data, outperforming baseline U-Net by 3%, while using 8x fewer parameters (1.95M vs 23M) in comparison to state-of-the-art methods, significantly reducing computational costs. Ablation experiments confirm CBAM reduces false positives and preserves fine boundaries; case studies show the model, in some cases, detects sub-visual dust features missed in ground truth annotations, suggesting potential utility for discovering marginal atmospheric phenomena. This work establishes an efficient framework for processing planetary image data while balancing accuracy and computational practicality.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105754"},"PeriodicalIF":4.2,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li
{"title":"GNN-based primitive recombination for compositional zero-shot learning","authors":"Fuqin Deng , Caiyun Tang , Lanhui Fu , Wei Jin , Jiaming Zhong , Hongming Wang , Nannan Li","doi":"10.1016/j.imavis.2025.105762","DOIUrl":"10.1016/j.imavis.2025.105762","url":null,"abstract":"<div><div>Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute–object combinations, with the core challenge being the complex visual manifestations across compositions. We posit that the key to address this challenge lies in enabling models to simulate human recognition processes by decomposing and dynamically recombining primitives (attributes and objects). Existing methods merely concatenate primitives after extraction to form new combinations, without achieving deep integration between attributes and objects to create truly novel compositions. To address this issue, we propose Graph Neural Network-based Primitive Recombination (GPR) framework. This framework innovatively designs a Primitive Recombination Module (PRM) based on the Compositional Matching Module (CMM). Specifically, we first extract primitives, and build independent attribute and object space based on the CLIP model, enabling more precise learning of primitive-level visual features and reducing information residuals. Additionally, we introduce a Virtual Composition Unit (VCU), which inputs optimized primitive features as nodes into GNN and models complex interaction relationships between attributes and objects through message propagation. The module performs mean pooling on the updated node features to obtain a recombined representation and fuses the global visual information from the original image through residual connections, generating semantically rich virtual compositional features while preserving key visual cues. We conduct extensive experiments on three CZSL benchmark datasets to show that GPR achieves state-of-the-art or competitive performance in both closed-world and open-world settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105762"},"PeriodicalIF":4.2,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced crowd counting with weighted attention network and multi-scale feature integration","authors":"Lifang Zhou , Zhen Hu","doi":"10.1016/j.imavis.2025.105750","DOIUrl":"10.1016/j.imavis.2025.105750","url":null,"abstract":"<div><div>Crowd counting plays a crucial role in the field of computer vision, particularly in practical applications such as traffic monitoring. However, current methods that establish mappings between original images and density maps are not only prone to overfitting but also struggle with occlusion and scale variation in crowded scenes. In this paper, we propose a novel Weighted Attention Focusing Network (WAFNet) to enhance crowd counting performance by decoupling the image-density mapping. Our approach first employs a two-stage model to separate the image density map. It then introduces a weight map, generated by the front-end network, to address the issue of scale variation. Additionally, we incorporate a Multi-Layer Feature Compilation Module (MLFCM) to better preserve and fuse features from multiple layers and adopt a Low-Resolution Feature Enhancement Module (LRFEM) to enhance the low-resolution features of the crowd. Experiments conducted on six benchmark crowd counting datasets demonstrate that our method achieves improved performance, particularly in dense and occluded scenes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105750"},"PeriodicalIF":4.2,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CDAF: Cross-Modal and Dual-channel Upsample Adaptive Fusion network for Point Cloud Completion","authors":"Ming Lu , Jian Li , Duo Han Zhao, Qin Wang","doi":"10.1016/j.imavis.2025.105735","DOIUrl":"10.1016/j.imavis.2025.105735","url":null,"abstract":"<div><div>In real-world scenarios, point cloud data often suffers from incompleteness due to limitations in sensor viewpoints, resolution constraints, and self-occlusions, which hinders its applications in domains such as autonomous driving and robotics. To address these challenges, this paper proposes a novel Cross-Modal and Dual-channel Upsample Adaptive Fusion network (CDAF), Our framework innovatively integrates depth maps with point clouds through dual-channel attention and gating units, significantly improving completion accuracy and detail recovery. The framework comprises two core modules: Cross-Modal Feature Enhancement (CMFE) and Dual-channel Upsampling Adaptive Fusion (DUAF). CMFE enhances point cloud feature representation by leveraging Spatial-activated Channel Attention to model channel-wise dependencies and Max-Sigmoid Attention to align cross-modal features between depth maps and point clouds, DUAF progressively refines coarse point clouds through a parallel structural analysis and similarity alignment branches, enabling adaptive fusion of local geometric priors and global shape consistency. Experimental results on multiple benchmark datasets demonstrate that CDAF surpasses existing state-of-the-art methods in point cloud completion tasks, showcasing superior global shape understanding and detail recovery.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105735"},"PeriodicalIF":4.2,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modified ResNet model for medical image-based lung cancer detection","authors":"Zeyad Q. Habeeb , Branislav Vuksanovic , Imad Q. Alzaydi","doi":"10.1016/j.imavis.2025.105752","DOIUrl":"10.1016/j.imavis.2025.105752","url":null,"abstract":"<div><div>Lung cancer is still the most common cause of tumor death in the world. Therefore, there is a great demand to develop diagnostic tools for lung cancer. This research proposes a diagnostically tuned modified ResNet 50 model for detecting and diagnosing lung cancer from chest X-ray images. The architecture of ResNet 50 is adapted to be more suitable for the unique challenges presented by medical imaging data. The modifications include adding extra batch normalization layers for stabilizing training, replacing fully connected layers with global average pooling to reduce overfitting, and adding a squeeze-and-excitation (SE) block that enhances the model's focus on key features such as nodules and lesions. Furthermore, transfer learning was performed on the pre-trained ResNet 50 weights, and the model was fine-tuned to the dataset of images of lungs for better sensitivity regarding cancerous patterns. This modified ResNet 50 was evaluated on a publicly available dataset of lung images from the JSRT dataset, which outperforms the original ResNet 50 and state-of-the-art research. The proposed model achieves high sensitivity, specificity, precision, F1-score and accuracy, which are considered the most important factors in clinical settings. Accuracy reached as high as 98.77% in the detection of lung cancer, as shown by the results. The results also show that the modified ResNet model can be a highly reliable and efficient tool for the early detection of lung cancer. As a result, the improved architecture leads to better diagnostic accuracy and reduced computational complexity so it can be used in medical imaging with real-time applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105752"},"PeriodicalIF":4.2,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deqiang Cheng , Xingchen Xu , Haoxiang Zhang , Tianshu Song , He Jiang , Qiqi Kou
{"title":"Zero-shot object detection based on cross-modal guided clustering","authors":"Deqiang Cheng , Xingchen Xu , Haoxiang Zhang , Tianshu Song , He Jiang , Qiqi Kou","doi":"10.1016/j.imavis.2025.105664","DOIUrl":"10.1016/j.imavis.2025.105664","url":null,"abstract":"<div><div>At present, contrastive learning has been widely used in Zero-Shot Object Detection (ZSD) and proved to be able to reduce inter-class confusion. However, existing ZSD clustering algorithms operate spontaneously, without effective guidance, and may therefore cluster in the wrong places, as they are constrained to a single visual modality. It is difficult to achieve cross-modal alignment, and textual guidance can help achieve ideal visual clustering. In view of the above problems, this paper proposes a novel zero-shot object detection method based on cross-modal guided clustering, which is a new method for ZSD that combines image-to-image contrast with an auxiliary image-to-text contrast during training. Firstly, an instance-level cross-modal contrastive embedding (ICCE) loss is proposed, by which text similarities are used as dynamic weights to guide the modal to focus on the most confusing categories, and ignoring low similarity ones. A cross-level cross-modal contrastive embedding (CCCE) loss based on ICCE is also designed to provide an ideal guided cluster center. Finally, a cross-modal triplet loss (CTL) is introduced to divide anchors into positive and negative anchors to address the problem that negative samples are difficult to cluster effectively. The first two highlight class-level similarities to avoid misclassification in the most confusing categories, while the last focuses on capturing the most challenging cases to ensure it can handle difficult instances effectively. Experimental tests and comparisons are conducted with the current advanced methods on three baseline databases, and the results demonstrate that the proposed method can achieve a better detection effect, especially when the number of training categories is limited.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105664"},"PeriodicalIF":4.2,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145265162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OneN: Guided attention for natively-explainable anomaly detection","authors":"Pasquale Coscia, Angelo Genovese, Vincenzo Piuri, Fabio Scotti","doi":"10.1016/j.imavis.2025.105741","DOIUrl":"10.1016/j.imavis.2025.105741","url":null,"abstract":"<div><div>In industrial computer vision applications, anomaly detection (AD) is a critical task for ensuring product quality and system reliability. However, many existing AD systems follow a modular design that decouples classification from detection and localization tasks. Although this separation simplifies model development, it often limits generalizability and reduces practical effectiveness in real-world scenarios. Deep neural networks offer strong potential for unified solutions. Nonetheless, most current approaches still treat detection, localization and classification as separate components, hindering the development of more integrated and efficient AD pipelines. To bridge this gap, we propose OneN (One Network), a unified architecture that performs detection, localization, and classification within a single framework. Our approach distills knowledge from a high-capacity convolutional neural network (CNN) into an attention-based architecture trained under varying levels of supervision. The resulting attention maps act as interpretable pseudo-segmentation masks, enabling accurate localization of anomalous regions. To further enhance localization quality, we introduce a progressive focal loss that guides attention maps at each layer to focus on critical features. We validate our method through extensive experiments on both standardized and custom-defined industrial benchmarks. Even under weak supervision, it improves performance, reduces annotation effort, and facilitates scalable deployment in industrial environments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105741"},"PeriodicalIF":4.2,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PHMG: Prompt-based Human Motion Generation for action recognition","authors":"Kai Lu, Long Liu, Xin Wang, Siying Ren","doi":"10.1016/j.imavis.2025.105748","DOIUrl":"10.1016/j.imavis.2025.105748","url":null,"abstract":"<div><div>Data generation is an effective method to address inefficient and costly data collection in action recognition. Skeleton data is more robust to illumination and background than RGB data. Therefore, the generation of skeleton motions holds greater value. Existing skeleton motion generation methods generate motions that deviate from the real motion data distribution, leading to blurred inter-class boundaries and adversely affecting action recognition accuracy. In this paper, we propose a Prompt-based Human Motion Generation Network (PHMG), which consists of a Prompt-based Generation Module (PGM) and an Active Optimization Module (AOM). The encoder within the PGM integrates spatio-temporal dual-branch self-attention with graph convolution, effectively capturing both local and global motion features while maintaining the independence of spatio-temporal representations. Moreover, the PGM integrates Contrastive Language–Image Pre-Training (CLIP) encoded textual prompts into the generation process adaptively through the proposed Adaptive Weight(AW). The AOM comprises a recognition network and an active optimization layer. The recognition network produces prediction vectors for the motions generated by the PGM, while the active optimization layer evaluates these vectors using an uncertainty metric to optimize the generated motions. The PGM and AOM operate alternately to generate a refined set of motions iteratively. Extensive experiments on public datasets, namely NTU-RGB+D and NTU-RGB+D 120, reveals that our PHMG achieves excellent results in both qualitative and quantitative assessments. Notably, we attain 2.48 FMD, 92.98% accuracy on NTU-RGB+D, and 9.24 FMD, 58.47% accuracy on NTU-RGB+D 120.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105748"},"PeriodicalIF":4.2,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}