{"title":"Edge-guided semantic-aware network for camouflaged object detection with PVTv2","authors":"Hongbo Bi, Jianing Yu, Disen Mo, Shiyuan Li, Cong Zhang","doi":"10.1016/j.imavis.2025.105720","DOIUrl":"10.1016/j.imavis.2025.105720","url":null,"abstract":"<div><div>Camouflaged object detection (COD) attempts to identify and segment objects visually blended into their surroundings, presenting significant challenges in complex real-world scenarios. Despite growing attention, existing COD methods often yield unsatisfactory performance, primarily due to their inadequate integration of edge information and semantic context—a critical shortcoming when handling intricate scenes. To this end, we propose a novel Edge-guided Semantic-aware Network (ESNet) that explicitly leverages the synergy between edge cues and multi-scale semantics. Our framework incorporates two key components: a Context-Aware Aggregation with Edge Guidance (CAEG) module, which utilizes edge information to refine object boundaries and enhance feature representation across scales, and a Cross-layer Semantic-Refinement Fusion (CSF) module, designed to aggregate and reinforce multi-level semantic context for richer feature characterization. Numerous experiments on three challenging benchmark datasets demonstrate that the proposed ESNet outperforms 17 state-of-the-art algorithms, achieving new standards in detection accuracy and robustness.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105720"},"PeriodicalIF":4.2,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144988351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asma Aldrees , Nihal Abuzinadah , Muhammad Umer , Dina Abdulaziz AlHammadi , Shtwai Alsubai , Raed Alharthi
{"title":"Deepfake detection using optimized VGG16-based framework enhanced with LIME for secure digital content","authors":"Asma Aldrees , Nihal Abuzinadah , Muhammad Umer , Dina Abdulaziz AlHammadi , Shtwai Alsubai , Raed Alharthi","doi":"10.1016/j.imavis.2025.105696","DOIUrl":"10.1016/j.imavis.2025.105696","url":null,"abstract":"<div><div>The rapid evolution of technologies to manipulate facial images, namely Generative Adversarial Networks (GANs) and those based on Stable Diffusion, has increased the need for effective deepfake detection mechanisms to mitigate their misuse. In this paper, the critical challenge of detecting deepfake images is addressed through a new deep learning-based approach that uses the VGG16 model after applying all necessary preprocessing steps. The VGG16 architecture was chosen for its deep structure and strong ability to capture intricate facial patterns when classifying facial images as real or manipulated. A robust preprocessing pipeline — including normalization, augmentation, facial alignment, and noise reduction — was implemented to optimize input data, improving the detection of subtle manipulations. Additionally, Explainable AI (XAI) techniques, such as the Local Interpretable Model-agnostic Explanations (LIME) framework, were integrated to provide transparent, visual explanations of the model’s predictions, enhancing interpretability and user trust. To further assess generalizability, the evaluation was extended beyond the initial dataset by incorporating three additional benchmark datasets: FaceForensics++, Celeb-DF (v2), and the DFDC Preview Set. These datasets contain a range of manipulation techniques, allowing for comprehensive testing of the model’s robustness across different scenarios. The proposed method outperformed baselines with exceptional performance metrics (accuracy, precision, recall, and F1-score up to 0.99), and maintained strong results across different datasets. These findings demonstrate that combining XAI approaches with a VGG16 model and thorough preprocessing effectively counters advanced deepfake generation techniques, such as StyleGAN2. This research contributes to a safer digital landscape by improving the detection and understanding of manipulated content, providing a practical way to confront the growing threat of deepfakes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105696"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144925811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengkai Liu , Jun Miao , Yuanhua Qiao , Hainan Wang
{"title":"TPWGAN: Wavelet-aware text prior guided super-resolution for scene text images","authors":"Shengkai Liu , Jun Miao , Yuanhua Qiao , Hainan Wang","doi":"10.1016/j.imavis.2025.105707","DOIUrl":"10.1016/j.imavis.2025.105707","url":null,"abstract":"<div><div>Scene text image super-resolution (STISR) is crucial for improving the readability and recognition accuracy of low-resolution text images. Many previous methods have incorporated text prior information, such as character sequences or recognition features, into super-resolution frameworks. However, existing methods struggle to recover fine-grained text structures, often introducing artifacts or blurry edges due to insufficient high-frequency (HF) modeling and suboptimal use of text priors. Although some recent approaches incorporate wavelet-domain losses into the generator, they typically retain RGB-domain losses during adversarial training, limiting their ability to distinguish authentic text details from artifacts. To address this, we propose TPWGAN, a GAN-based STISR framework that introduces wavelet-domain losses in both the generator and discriminator. The generator is trained with fidelity losses on the HF wavelet subbands to enhance sensitivity to stroke-level variations, while the discriminator processes HF wavelet subbands fused with binary text region masks via a spatial attention mechanism, enabling semantically guided frequency-aware discrimination. Experiments on the TextZoom dataset and several real-world benchmarks show that TPWGAN achieves consistent improvements in visual quality and text recognition, particularly for challenging text instances with distortions or low resolution.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105707"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144925810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Yang , Kunchi Li , Nanfeng Jiang, Yun Wu, Ziyu Li, Da-Han Wang
{"title":"LDH-Net: Luminance-based Deep Hybrid Network for Document Image De-shadowing","authors":"Fan Yang , Kunchi Li , Nanfeng Jiang, Yun Wu, Ziyu Li, Da-Han Wang","doi":"10.1016/j.imavis.2025.105705","DOIUrl":"10.1016/j.imavis.2025.105705","url":null,"abstract":"<div><div>Existing deep learning-based Document Image De-shadowing (DID) methods face two key challenges. First, they struggle with complex shadows due to insufficient use of auxiliary information, such as shadow locations and illumination details. Second, they fail to effectively balance global relationships across the entire image with local feature learning to restore texture details in shadowed regions. To address these limitations, we propose a dual-branch de-shadowing network, called LDH-Net, which integrates luminance information as an auxiliary information for de-shadowing. The first branch extracts shadow-distorted features by estimating a shadow luminance map, while the second branch uses them to locate shadow regions and guide the de-shadowing. Both branches employ a hybrid feature learning mechanism to capture local and global information efficiently with lower complexity. This mechanism includes two key modules: Horizon-Vertical Attention (HVA) and Dilated Convolution Mamba (DCM). HVA models long-range pixel dependencies to propagate contextual information across the entire image to ensure global coherence and consistency. DCM utilizes dilated convolution within the State Space Model (SSM) to capture extensive contextual information and preserve local image details. Additionally, we introduce a luminance map loss to provide accurate optimization for reconstruction. Experiments on RDD, Kligler’s, Jung’s, and OSR demonstrate that LDH-Net outperforms previous state-of-the-art methods. Specifically, LDH-Net achieves the best PSNR/SSIM/LPIPS scores across all datasets, with up to 37.76 PSNR/0.981 SSIM/0.005 LPIPS on RDD datasets and consistent improvements on other benchmarks, confirming its superior performance on both visual quality and structural preservation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105705"},"PeriodicalIF":4.2,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144913059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating blood pressure using video-based PPG and deep learning","authors":"Gianluca Zaza, Gabriella Casalino, Sergio Caputo, Giovanna Castellano","doi":"10.1016/j.imavis.2025.105683","DOIUrl":"10.1016/j.imavis.2025.105683","url":null,"abstract":"<div><div>This paper introduces a novel pipeline for estimating systolic and diastolic blood pressure using remote photoplethysmographic (rPPG) signals derived from video recordings of subjects’ faces. The pipeline consists of three main stages: rPPG signal extraction, denoising to transform the rPPG signal into a PPG-like waveform, and blood pressure estimation. This approach directly addresses the current lack of datasets that simultaneously include video, rPPG, and blood pressure data. To overcome this, the proposed pipeline leverages the extensive availability of PPG-based blood pressure estimation techniques, in combination with state-of-the-art algorithms for rPPG extraction, enabling the generation of reliable PPG-like signals from video input.</div><div>To validate the pipeline, we conducted comparative analyses with state-of-the-art methods at each stage and collected a dedicated dataset through controlled laboratory experimentation. The results demonstrate that the proposed solution effectively captures blood pressure information, achieving a mean error of 9.2 ± 11.3 mmHg for systolic and 8.6 ± 9.1 mmHg for diastolic blood pressure. Moreover, the denoised rPPG signals show a strong correlation with conventional PPG signals, supporting the reliability of the transformation process. This non-invasive and contactless method offers considerable potential for long-term blood pressure monitoring, particularly in Ambient Assisted Living (AAL) systems, where unobtrusive and continuous health monitoring is essential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105683"},"PeriodicalIF":4.2,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144913061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaohui Guo , Luanyuan Dai , Xinwei Gan , Yuting Huang , Miaohua Ruan , Detian Huang
{"title":"One-step diffusion for real-world image super-resolution via degradation removal and text prompts","authors":"Yaohui Guo , Luanyuan Dai , Xinwei Gan , Yuting Huang , Miaohua Ruan , Detian Huang","doi":"10.1016/j.imavis.2025.105699","DOIUrl":"10.1016/j.imavis.2025.105699","url":null,"abstract":"<div><div>Pre-trained Text-to-Image (T2I) diffusion models have shown remarkable progress in Real-world Image Super-Resolution (Real-ISR) by leveraging powerful latent space priors. However, these models typically require tens or even hundreds of diffusion steps for high-quality reconstruction, posing two critical challenges: (1) excessive computational overhead, hindering practical deployment; and (2) inherent stochasticity, leading to output uncertainty. To overcome these limitations, we propose a One-Step Diffusion framework for Real-ISR via Degradation Removal and Text Prompts (OSD-DRTP). Specifically, the proposed OSD-DRTP comprises two principal components: (1) a Degradation Removal Module (DRM), which eliminates complex real-world image degradations to restore fidelity; and (2) a Detail Enhancement Module (DEM), which integrates a fine-tuned diffusion model with text prompts from a large language model to enhance perceptual quality. In addition, we introduce Variational Score Distillation (VSD) in the latent space to ensure high-fidelity reconstruction across diverse degradation patterns. To further exploit the latent capacity of the VAE decoder, we employ a hybrid loss combining mean squared error (MSE) and perceptual loss (LPIPS), enabling accurate texture restoration without auxiliary modules. Extensive experiments demonstrate that the proposed OSD-DRTP outperforms state-of-the-art methods in both perceptual quality and computational efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105699"},"PeriodicalIF":4.2,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144913060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synergistic-aware cascaded association and trajectory refinement for multi-object tracking","authors":"Hui Li, Su Qin, Saiyu Li, Ying Gao, Yanli Wu","doi":"10.1016/j.imavis.2025.105695","DOIUrl":"10.1016/j.imavis.2025.105695","url":null,"abstract":"<div><div>Multi-object tracking (MOT) is a pivotal research area in computer vision. Effectively tracking objects in scenarios with frequent occlusions and crowded scenes has become a key challenge in MOT tasks. Existing tracking-by-detection (TbD) methods often rely on simple two-frame association techniques. However, in situations involving scale transformation or requiring long-term association, frequent occlusion between objects can lead to ID switches, especially in scenes with dense or highly intersecting objects. Therefore, we propose a synergistic-aware cascaded association and trajectory refinement method (SCTrack) for multi-object tracking. In the data association stage, we propose a synergistic-aware cascaded association method to construct a multi-perception affinity matrix for object association, and introduce the multi-frame collaborative distance calculation to enhance the robustness. To address the problem of trajectory fragmentation, we propose a dynamic confidence-driven trajectory refinement post-processing method. This method integrates confidence and feature information to calculate trajectory association, repair fragmented trajectories, and improve the overall robustness of the tracking algorithm. Extensive experiments on the MOT17, MOT20, and DanceTrack datasets validate SCTrack’s competitive performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105695"},"PeriodicalIF":4.2,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144906956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAMUNet: Enhancing pillar-based 3D object detection in autonomous driving with Shape-aware Mini-Unet","authors":"Liping Zhu, Xuan Li, Bohui Li, Chengyang Li, Bingyao Wang, XianXiang Chang","doi":"10.1016/j.imavis.2025.105703","DOIUrl":"10.1016/j.imavis.2025.105703","url":null,"abstract":"<div><div>Pillar-based 3D object detection methods outperform traditional point-based and voxel-based methods in terms of speed. However, existing methods struggle with accurately detecting large objects in complex environments due to the limitations in capturing global spatial dependencies. To address these issues, this paper proposes Shape-aware Mini-Unet Network (SAMUNet), a simple yet effective hierarchical 3D object detection network. SAMUNet incorporates multiple Sparse Mini-Unet blocks and a Shape-aware Center Head. Concretely, after converting the original point cloud into pillars, we first progressively reduce the spatial distance between distant features through downsampling in the Sparse Mini-Unet block. Then, we recover lost details through multi-scale feature fusion, enhancing the model’s ability to detect various objects. Unlike other methods, the upsampling operation in the Sparse Mini-Unet block only processes the effective feature coverage area of the intermediate feature map, significantly reducing computational costs. Finally, to further improve the accuracy of bounding box regression, we introduce Shape-aware Center Head, which models the geometric information of the bounding box’s offset direction and scale using 3D Shape-aware IoU. Extensive experiments on the nuScenes and Waymo datasets demonstrate that SAMUNet excels in detecting large objects and overall outperforms current state-of-the-art detectors, achieving 72.0% NDS and 67.7% mAP.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105703"},"PeriodicalIF":4.2,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144932161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bohan Yang , Yong Luo , Bo Du , Dongjing Shan , Chuan Cheng , Gang Liu , Jun Zhang , Jingnan Liu
{"title":"DoseNet: Dose-adaptive prediction of the parotid glands deformation for radiotherapy planning","authors":"Bohan Yang , Yong Luo , Bo Du , Dongjing Shan , Chuan Cheng , Gang Liu , Jun Zhang , Jingnan Liu","doi":"10.1016/j.imavis.2025.105701","DOIUrl":"10.1016/j.imavis.2025.105701","url":null,"abstract":"<div><div>Parotid glands (PGs) toxicity caused by radiation-induced anatomy deformation occurs among a significant amount of patients with nasopharyngeal carcinoma treated with radiotherapy. Early prediction of PGs deformation is critical, as it can facilitate the design of treatment plans to reduce radiation-induced anatomical change in an adaptive radiotherapy workflow. Previous studies used CT images to model anatomical variation in radiotherapy. However, they did not consider the radiation dose received by the PGs which is correlated to the PGs volumetric change and can influence the anatomical variation. To address this issue, we propose DoseNet, a dose-adaptive PGs deformation prediction deep neural network, which utilizes the radiation dose and CT images to generate different anatomy predictions accommodating to the changing dose. Specifically, we use parted dose input and multi-scale cross attention to reinforce the integration of PGs anatomy and the dose received by PGs, and present a novel data augmentation method to remedy the shortcoming of the skewed data distribution of the radiation dose. Besides, to help design improved treatment plans, a novel metric termed dose volume variation (DVV) curve is developed to visualize the predicted volumetric change in respect to the dose variation of the PGs. We verify the effectiveness of our method on a dataset collected from a collaborative hospital. The experiment results show the proposed DoseNet outperforms the state-of-the-arts on the dataset and attains a Dice coefficient of 82.2% and a relative volume difference of 12.2%. The code is available at <span><span>https://github.com/mkdermo/DoseNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105701"},"PeriodicalIF":4.2,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inayatul Haq , Zheng Gong , Haomin Liang , Wei Zhang , Rashid Khan , Lei Gu , Roland Eils , Yan Kang , Bingding Huang
{"title":"A review of breast cancer histopathology image analysis with deep learning: Challenges, innovations, and clinical integration","authors":"Inayatul Haq , Zheng Gong , Haomin Liang , Wei Zhang , Rashid Khan , Lei Gu , Roland Eils , Yan Kang , Bingding Huang","doi":"10.1016/j.imavis.2025.105708","DOIUrl":"10.1016/j.imavis.2025.105708","url":null,"abstract":"<div><div>Breast cancer (BC) is the most frequently diagnosed cancer among women and a leading cause of cancer-related mortality globally. Accurate and timely diagnosis is essential for improving patient outcomes. However, traditional histopathological assessments are labor-intensive and subjective, leading to inter-observer variability and diagnostic inconsistencies, especially in resource-limited settings. Furthermore, variability in tissue staining, limited availability of standardized annotated datasets, and subtle morphological patterns complicate the consistent characterization of tumors. Deep learning (DL) has recently emerged as a transformative technology in breast cancer pathology, providing automated and objective solutions for cancer detection, classification, and segmentation from histopathological images. This review systematically evaluates advanced deep learning (DL) architectures, including convolutional neural networks (CNNs), generative adversarial networks (GANs), autoencoders, deep belief networks (DBNs), extreme learning machines (ELMs), and transformer-based models such as Vision Transformers (ViTs) as well as transfer learning, attention-based explainable AI techniques, and multimodal integration to address these diagnostic challenges. Analyzing 199 references, including 182 peer-reviewed studies published between 2014 and 2025 and 17 reputable online sources (websites, databases, etc.), we identify key innovations, limitations, and opportunities for future research. Furthermore, we explore the critical roles of synthetic data augmentation, explainable AI (XAI), and multimodal integration to enhance clinical trust, model interpretability, and diagnostic precision, ultimately facilitating personalized and efficient patient care.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105708"},"PeriodicalIF":4.2,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144892839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}