{"title":"Enhancing cross-domain generalization in retinal image segmentation via style randomization and style normalization","authors":"Song Guo","doi":"10.1016/j.imavis.2025.105694","DOIUrl":"10.1016/j.imavis.2025.105694","url":null,"abstract":"<div><div>Retinal image segmentation is a crucial procedure for automatically diagnosing ophthalmic diseases. However, existing deep learning-based segmentation models suffer from the domain shift issue, i.e., the segmentation accuracy decreases significantly when the test and training images are sampled from different distributions. To overcome this issue, we focus on the challenging single-source domain generalization scenario, where we expect to train a well-generalized segmentation model on unseen test domains with only access to one domain during training. In this paper, we present a style randomization method, which performs random scaling transformation to the LAB components of the training image, to enrich the style diversity. Furthermore, we present a style normalization method to effectively normalize style information while preserving content by channel-wise feature standardization and dynamic feature affine transformation. Our approach is evaluated on four types of retinal image segmentation tasks, including retinal vessel, optic cup, optic disc, and hard exudate. Experimental results demonstrate that our method achieves competitive or superior performance compared to state-of-the-art approaches. Specifically, it outperforms the second-best method by 3.9%, 2.6%, and 4.8% on vessel, optic cup, and hard exudate segmentation tasks, respectively. Our code will be released at <span><span>https://github.com/guomugong/SRN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105694"},"PeriodicalIF":4.2,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Caruso , Lucia Cimmino , Fabio Narducci , Chiara Pero , Gianluca Ronga
{"title":"Advancements in basketball action recognition: Datasets, methods, explainability, and synthetic data applications","authors":"Marco Caruso , Lucia Cimmino , Fabio Narducci , Chiara Pero , Gianluca Ronga","doi":"10.1016/j.imavis.2025.105689","DOIUrl":"10.1016/j.imavis.2025.105689","url":null,"abstract":"<div><div>Basketball Action Recognition (BAR) has received increasing attention in the fields of computer vision and artificial intelligence, serving as a fundamental component in performance evaluation, automated game annotation, tactical analysis, and referee decision-making support. Despite notable advancements driven by deep learning approaches, BAR remains a challenging task due to the inherent complexity of basketball movements, frequent occlusions, and limited availability of standardized benchmark datasets. This survey provides a comprehensive and structured synthesis of current developments in BAR research, encompassing four principal dimensions: dataset curation, computational methodologies, synthetic data generation, and model explainability. A critical analysis of publicly available basketball-specific datasets is presented, delineating their modalities, annotation strategies, action taxonomies, and representational scope. Furthermore, the survey offers a structured classification of state-of-the-art action recognition methodologies, ranging from video-based and skeleton-based models to sensor-driven and multimodal fusion approaches, emphasizing architectural characteristics, evaluation protocols, and task-specific adaptations. The role of synthetic data is systematically examined as a means to address data scarcity, reduce annotation noise, and enhance model generalization through controlled variability and simulation-based augmentation. In parallel, the integration of explainable artificial intelligence (XAI) techniques is also analyzed, with a focus on post-hoc attribution methods, probabilistic reasoning models, and interpretable neural architectures, aimed at improving the transparency and accountability of decision-making processes. The survey identifies persisting research challenges, including dataset heterogeneity, limitations in cross-domain transferability, and the accuracy-interpretability trade-off in deep models. By delineating current limitations and prospective directions, this work provides a foundational reference to guide the development of robust, generalizable, and explainable BAR systems for deployment in real-world sports intelligence applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105689"},"PeriodicalIF":4.2,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcello Di Giammarco , Antonella Santone , Mario Cesarelli , Fabio Martinelli , Francesco Mercaldo
{"title":"A method for skin lesion detection and localization by means of Deep Learning and reliable prediction explainability","authors":"Marcello Di Giammarco , Antonella Santone , Mario Cesarelli , Fabio Martinelli , Francesco Mercaldo","doi":"10.1016/j.imavis.2025.105675","DOIUrl":"10.1016/j.imavis.2025.105675","url":null,"abstract":"<div><div>Skin lesions are any abnormal growths or appearances on the skin, ranging from benign (i.e., non-cancerous) to malignant (i.e., cancerous). The identification of a skin lesion is a crucial task that is carried out in short periods of time to initiate an eventual therapeutic treatment. In this paper, we propose a method for automatic skin lesion detection, implementing Convolutional Neural Networks. Moreover, with the aim of providing a rationale behind the model prediction, we also consider explainability by adopting two different Class Activation Mapping algorithms, which highlight regions in skin images that most contribute to the network’s classification decision. We also include the indices of similarity for further quantitative analysis. Several Convolutional Neural Networks are considered, by obtaining the best results with the MobileNet model, achieving an accuracy equal to 0.935 in skin lesion detection. Moreover, in the experimental analysis, we discuss the effectiveness of Class Activation Mapping algorithms exploited for skin lesion localization.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105675"},"PeriodicalIF":4.2,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144840691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manar N. Amin , Muhammad A. Rushdi , Rasha Kamal , Amr Farouk , Mohamed Gomaa , Noha M. Fouad , Ahmed M. Mahmoud
{"title":"A deep learning approach for contrast-agent-free breast lesion detection and classification using adversarial synthesis of contrast-enhanced mammograms","authors":"Manar N. Amin , Muhammad A. Rushdi , Rasha Kamal , Amr Farouk , Mohamed Gomaa , Noha M. Fouad , Ahmed M. Mahmoud","doi":"10.1016/j.imavis.2025.105692","DOIUrl":"10.1016/j.imavis.2025.105692","url":null,"abstract":"<div><div>Contrast-enhanced digital mammography (CEDM) has emerged as a promising complementary imaging modality for breast cancer diagnosis, offering enhanced lesion visualization and improved diagnostic accuracy, particularly for patients with dense breast tissues. However, the reliance of CEDM on contrast agents poses challenges to patient safety and accessibility. To overcome those challenges, this paper introduces a deep learning methodology for improved breast lesion detection and classification. In particular, an image-to-image translation model based on cycle-consistent generative adversarial networks (CycleGAN) is utilized to generate synthetic CEDM (SynCEDM) images from full-field digital mammography in order to enhance visual contrast perception without the need for contrast agents. A new dataset of 3958 pairs of low-energy (LE) and CEDM images was collected from 2908 female subjects to train the CycleGAN model to generate SynCEDM images. Thus, we trained different You-Only-Look-Once (YOLO) architectures on CEDM and SynCEDM images for breast lesion detection and classification. SynCEDM images were generated with a structural similarity index (SSIM) of 0.94 ± 0.02. A YOLO lesion detector trained on original CEDM images led to a 91.34% accuracy, a 90.37% sensitivity, and a 92.06% specificity. In comparison, a detector trained on the SynCEDM images exhibited a comparable accuracy of 91.20%, a marginally higher sensitivity of 91.44%, and a slightly lower specificity of 91.30%. This approach not only aims to mitigate contrast agent risks but also to improve breast cancer detection and characterization using mammography.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105692"},"PeriodicalIF":4.2,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EGU-GS: Efficient Gaussian utilization for real-time 3D Gaussian splatting","authors":"Zhiyu Zheng, Dake Zhou, Yiming Shao, Xin Yang","doi":"10.1016/j.imavis.2025.105687","DOIUrl":"10.1016/j.imavis.2025.105687","url":null,"abstract":"<div><div>In recent years, 3D Gaussian Splatting (3DGS) has garnered significant attention for its superior rendering quality and real-time performance. However, the inefficient utilization of Gaussians in 3DGS necessitates the use of millions of Gaussian primitives to adapt to the geometry and appearance of 3D scenes, leading to significant redundancy. To address this issue, we propose an efficient adaptive density control strategy that incorporates Cross-Section-Oriented splitting and Heterogeneous cloning operations. These modifications prevent the proliferation of redundant Gaussians and improve Gaussian utilization. Furthermore, we introduce opacity adaptive pruning, adaptive thresholds, and Gaussian importance weights to refine the Gaussian selection process. Our post-processing Gaussian refinement pruning further eliminates small-scale and low-opacity Gaussians. Experimental results on various challenging datasets demonstrate that our method achieves state-of-the-art rendering quality while consuming less storage space, reducing the number of Gaussians by up to 42% compared to 3DGS. The code is available at: <span><span>https://github.com/zhiyu-cv/EGU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105687"},"PeriodicalIF":4.2,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zunair Safdar , Jinfang Sheng , Muhammad Usman Saeed , Muhammad Ramzan , A. Al-Zubaidi
{"title":"Empowering cardiovascular diagnostics with SET-MobileNet: A lightweight and accurate deep learning based classification approach","authors":"Zunair Safdar , Jinfang Sheng , Muhammad Usman Saeed , Muhammad Ramzan , A. Al-Zubaidi","doi":"10.1016/j.imavis.2025.105684","DOIUrl":"10.1016/j.imavis.2025.105684","url":null,"abstract":"<div><div>Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, necessitating early detection and accurate diagnosis for improved patient outcomes. This study introduces SET-MobileNet, a lightweight deep learning model designed for automated heart sound classification, integrating transformers to capture long-range dependencies and squeeze-and-excitation (SE) blocks to emphasize relevant acoustic features while suppressing noise artifacts. Unlike traditional methods that rely on handcrafted features, SET-MobileNet employs a multimodal feature extraction approach, incorporating log-mel spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), chroma features, and zero-crossing rates to enhance classification robustness. The model is evaluated across multiple publicly available heart sound datasets, including CirCor, HSS, GitHub, and Heartbeat Sounds, achieving a state-of-the-art accuracy of 99.95% for 2.0-second heart sound segments in the CirCor dataset. Extensive experiments demonstrate that multimodal feature representations significantly improve classification performance by capturing both time-frequency and spectral characteristics of heart sounds. SET-MobileNet is computationally efficient, with a model size of 8.61 MB and single-sample inference times under 6.5 ms, making it suitable for real-time deployment on mobile and embedded devices. Ablation studies confirm the contributions of transformers and SE blocks, showing incremental improvements in accuracy and noise suppression.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105684"},"PeriodicalIF":4.2,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144757950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Hu, Fen Chen, Zongju Peng, Lian Huang, Jiawei Xu
{"title":"MMCFNet: Multi-scale and multi-modal complementary fusion network for light field salient object detection","authors":"Xin Hu, Fen Chen, Zongju Peng, Lian Huang, Jiawei Xu","doi":"10.1016/j.imavis.2025.105680","DOIUrl":"10.1016/j.imavis.2025.105680","url":null,"abstract":"<div><div>Light field salient object detection (LFSOD) has received growing attention in recent years. Light field cameras record the direction and intensity of light in a scene, and they provide focal stacks and all-focus images with different but complementary characteristics. Previous LFSOD models lack effective feature fusion for multi-scale and multi-modal information, which leads to background interference or incomplete salient objects. In this paper, we propose a new multi-scale and multi-modal complementary fusion network (MMCFNet) for LFSOD. For the focal stacks, we design a slice interweaving enhancement module (SIEM) to emphasize the useful features among different slices and reduce inconsistency. In addition, we propose a new multi-scale and multi-modal fusion strategy, which contains high-level feature fusion module (HFFM), cross attention module (CrossA), and compact pyramid refinement (CPR) module. The HFFM fuses high-level multi-scale and multi-modal semantic information to accurately locate salient objects. The CrossA enhances low-level spatial-channel information and refines salient object edges. Finally, we use the CPR module to aggregate the multi-scale information and decode it into high-quality saliency maps. Extensive experiments on public datasets show that our method outperforms 11 state-of-the-art LFSOD methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105680"},"PeriodicalIF":4.2,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144770669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xudong Zhou , Jun Tang , Ke Wang , Nian Wang , Han Chen
{"title":"Graph hashing network for image retrieval","authors":"Xudong Zhou , Jun Tang , Ke Wang , Nian Wang , Han Chen","doi":"10.1016/j.imavis.2025.105677","DOIUrl":"10.1016/j.imavis.2025.105677","url":null,"abstract":"<div><div>Deep supervised hashing is more popular among researchers due to its satisfactory computational efficiency and retrieval performance. Most existing models learn hash codes for data by constructing inter-sample pair-wise or triplet losses, allowing for consideration of the topological information from the label space. However, the topological relationships among samples in the feature space are not fully explored, which may result in less discriminative hash codes. To address this issue, we propose a novel graph hashing network (GHash) for image retrieval. Our GHash explores positional relationships among samples under a large receptive field through alternating updates of graph nodes and edges, generating high-quality image descriptors based on optimized positional relationships and neighborhood information. Subsequently, graph-level descriptors are mapped into highly discriminative hash codes. Additionally, we introduce an extra classification loss to enhance the accuracy of the topological relationships among samples in the graph by supervising the learning of edge features. Finally, we conduct extensive comparison and ablation experiments on three benchmark datasets, with results demonstrating that our method achieves superior retrieval performance compared to state-of-the-art deep hashing methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105677"},"PeriodicalIF":4.2,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144770667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BTMTrack: Robust RGB-T tracking via dual-template bridging and temporal-modal candidate elimination","authors":"Zhongxuan Zhang, Bi Zeng, Xinyu Ni, Yimin Du","doi":"10.1016/j.imavis.2025.105676","DOIUrl":"10.1016/j.imavis.2025.105676","url":null,"abstract":"<div><div>RGB-T tracking leverages the complementary strengths of RGB and thermal infrared (TIR) modalities to handle challenging scenarios, such as low illumination and adverse weather conditions. However, existing methods often struggle to effectively integrate temporal information and perform efficient cross-modal interactions, limiting their adaptability to dynamic targets. In this paper, we propose BTMTrack, a novel RGB-T tracking framework. At its core lies a dual-template backbone and a Temporal-Modal Candidate Elimination (TMCE) strategy. The dual-template backbone enables the effective integration of temporal information. At the same time, the TMCE strategy guides the model to focus on target-relevant tokens by evaluating temporal and modal correlations through attention correlation maps across different modalities. This not only reduces computational overhead but also mitigates the influence of irrelevant background noise. Building on this foundation, we introduce the Temporal Dual-Template Bridging (TDTB) module, which utilizes a cross-modal attention mechanism to process dynamically filtered tokens, thereby enhancing precise cross-modal fusion. This approach further strengthens the interaction between templates and the search region. Extensive experiments conducted on three benchmark datasets demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art performance, with a 72.3% precision rate on the LasHeR test set and competitive results on the RGBT210 and RGBT234 datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105676"},"PeriodicalIF":4.2,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144749145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rahim Khan , Nada Alzaben , Yousef Ibrahim Daradkeh , Xianxun Zhu , Inam Ullah
{"title":"Pyramidal attention with progressive multi-stage iterative feature refinement for salient object segmentation","authors":"Rahim Khan , Nada Alzaben , Yousef Ibrahim Daradkeh , Xianxun Zhu , Inam Ullah","doi":"10.1016/j.imavis.2025.105670","DOIUrl":"10.1016/j.imavis.2025.105670","url":null,"abstract":"<div><div>Accurate detection of salient objects in complex visual scenes remains a fundamental yet challenging task in visual intelligence, often impeded by significant scale variation, background clutter, and indistinct object boundaries. While recent approaches attempt to exploit multi-level features, they frequently encounter limitations such as semantic misalignment across feature hierarchies, spatial detail degradation, and weak cross-dataset generalization. To overcome these challenges, we propose a novel Pyramidal Attention Mechanism (PAM) with Progressive Multi-stage Iterative Feature Refinement Network (PIFRNet) designed for robust and precise Salient Object Detection (SOD). Specifically, our method begins by hierarchically aggregating features from four representative stages of a powerful backbone, ensuring rich multi-scale context and semantic diversity. To bridge semantic gaps and recover fine structures, we introduce a Progressive Bilateral Feature Refinement (PBFR) module, which enhances early-stage features through cascaded convolutions and spatial attention. Furthermore, the novel PAM, equipped with dilated convolutions, is introduced to refine high-level semantics and reinforce object completeness. The network integrates these components through a multi-stage iterative refinement process, enabling gradual enhancement of spatial precision and structural fidelity. Extensive experiments conducted on five public SOD benchmarks demonstrate that our approach achieves superior performance compared to state-of-the-art methods, both quantitatively and qualitatively. Cross-dataset evaluations further validate its strong generalization capability, making it highly applicable to real-world visual intelligence scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105670"},"PeriodicalIF":4.2,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144739299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}