Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang
{"title":"Progressive Reverse Attention Network for image inpainting detection and localization","authors":"Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang","doi":"10.1016/j.cviu.2025.104407","DOIUrl":"10.1016/j.cviu.2025.104407","url":null,"abstract":"<div><div>Image inpainting is originally presented to restore damaged image areas, but it might be maliciously used for object removal that change image semantic content. This easily leads to serious public confidence crises. Up to present, image inpainting forensics works have achieved remarkable results, but they usually ignore or fail to capture subtle artifacts near object boundary, resulting in inaccurate object mask localization. To address this issue, we propose a Progressive Reverse Attention Network (PRA-Net) for image inpainting detection and localization. Different from the traditional convolutional neural networks (CNN) structure, PRA-Net follows an encoder and decoder architecture. The encoder leverages features at different scales with dense cross-connections to locate inpainted regions and generates global map with our designed multi-scale extraction module. A reverse attention module is used as the backbone of the decoder to progressively refine the details of predictions. Experimental results show that PRA-Net achieves accurate image inpainting localization and desirable robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104407"},"PeriodicalIF":4.3,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144291075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SASFNet: Soft-edge awareness and spatial-attention feedback deep network for blind image deblurring","authors":"Jing Cheng , Kaibing Zhang , Jiahui Hou , Yuhong Zhang , Guang Shi","doi":"10.1016/j.cviu.2025.104408","DOIUrl":"10.1016/j.cviu.2025.104408","url":null,"abstract":"<div><div>When a camera is used to capture moving objects in natural scenes, the obtained images will be degraded to varying degrees due to camera shaking and object displacement, which is called motion blurring. Moreover, the complexity of natural scenes makes the image motion deblurring more challenging. Now, there are two crucial problems in Deep Learning-based methods for blind motion deblurring: (1) how to restore sharp images with fine textures, and (2) how to improve the generalization of the model. In this paper, we propose Soft-edge Awareness and Spatial-attention Feedback deep Network (SASFNet) to restore sharp images. First, we restore images with fine textures using a soft-edge assist mechanism. This mechanism uses the soft edge extraction network to map the fine edge information from the blurred image to assist the model to restore the high-quality clear image. Second, for the generalization of the model, we propose feedback mechanism with attention. Similar to course learning, feedback mechanism imitates the human learning process, learning from easy to difficult to restore sharp images, which not only refines the restored features, but also brings better generalization. To evaluate the model, we use the GoPro dataset for model training and validity testing, and the Realblur dataset to test the generalization of the model. Experiments show that our proposed SASFNet can not only restore sharp images that are more in line with human perception, but also has good generalization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104408"},"PeriodicalIF":4.3,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast self-supervised 3D mesh object retrieval for geometric similarity","authors":"Kajal Sanklecha, Prayushi Mathur, P.J. Narayanan","doi":"10.1016/j.cviu.2025.104405","DOIUrl":"10.1016/j.cviu.2025.104405","url":null,"abstract":"<div><div>Digital 3D models play a pivotal role in engineering, entertainment, education, and various domains. However, the search and retrieval of these models have not received adequate attention compared to other digital assets like documents and images. Traditional supervised methods face challenges in scalability due to the impracticality of creating large, labeled collections of 3D objects. In response, this paper introduces a self-supervised approach to generate efficient embeddings for 3D mesh objects, facilitating ranked retrieval of similar objects. The proposed method employs a straightforward representation of mesh objects and utilizes an encoder–decoder architecture to learn the embedding. Extensive experiments demonstrate the competitiveness of our approach compared to supervised methods, showcasing its scalability across diverse object collections. Notably, the method exhibits transferability across datasets, implying its potential for broader applicability beyond the training dataset. The robustness and generalization capabilities of the proposed method are substantiated through experiments conducted on varied datasets. These findings underscore the efficacy of the approach in capturing underlying patterns and features, independent of dataset-specific nuances. This self-supervised framework offers a promising solution for enhancing the search and retrieval of 3D models, addressing key challenges in scalability and dataset transferability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104405"},"PeriodicalIF":4.3,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cascading attention enhancement network for RGB-D indoor scene segmentation","authors":"Xu Tang , Songyang Cen , Zhanhao Deng , Zejun Zhang , Yan Meng , Jianxiao Xie , Changbing Tang , Weichuan Zhang , Guanghui Zhao","doi":"10.1016/j.cviu.2025.104411","DOIUrl":"10.1016/j.cviu.2025.104411","url":null,"abstract":"<div><div>Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104411"},"PeriodicalIF":4.3,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
{"title":"HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery","authors":"Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li","doi":"10.1016/j.cviu.2025.104409","DOIUrl":"10.1016/j.cviu.2025.104409","url":null,"abstract":"<div><div>Object detection in UAV imagery poses substantial challenges due to severe object scale variation, dense distributions of small objects, complex backgrounds, and arbitrary orientations. These factors, compounded by high inter-class similarity and large intra-class variation caused by multi-scale targets, occlusion, and environmental interference, make aerial object detection fundamentally different from conventional scenes. Existing methods often struggle to capture global semantic information effectively and tend to overlook critical issues such as feature loss during downsampling, information redundancy, and inconsistency in cross-level feature interactions. To address these limitations, this paper proposes a hybrid CNN-Transformer-based detector, termed HCTD, specifically designed for UAV image analysis. The proposed framework integrates three novel modules: (1) a Feature Filtering Module (FFM) that enhances discriminative responses and suppresses background noise through dual global pooling (max and average) strategies; (2) a Convolutional Additive Self-attention Feature Interaction (CASFI) module that replaces dot-product attention with a lightweight additive fusion of spatial and channel interactions, enabling efficient global context modeling at reduced computational cost; and (3) a Global Context Flow Feature Pyramid Network (GC2FPN) that facilitates multi-scale semantic propagation and alignment to improve small-object detection robustness. Extensive experiments on the VisDrone2019 dataset demonstrate that HCTD-R18 and HCTD-R50 achieve 38.2%/43.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>50</mn></mrow></msub></math></span>, 23.1%/24.6% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>75</mn></mrow></msub></math></span>, and 13.9%/14.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mi>S</mi></mrow></msub></math></span> respectively. Additionally, the TIDE toolkit is employed to analyze the absolute and relative contributions of six error types, providing deeper insight into the effectiveness of each module and offering valuable guidance for future improvements. The code is available at: <span><span>https://github.com/Mundane-X/HCTD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104409"},"PeriodicalIF":4.3,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CSGN:CLIP-driven semantic guidance network for Clothes-Changing Person Re-Identification","authors":"Yang Lu , Bin Ge , Chenxing Xia , Junming Guan","doi":"10.1016/j.cviu.2025.104406","DOIUrl":"10.1016/j.cviu.2025.104406","url":null,"abstract":"<div><div>Clothes-Changing Person Re-identification (CCReID) aims to match identities across images of individuals in different attires. Due to the significant appearance variations caused by clothing changes, distinguishing the same identity becomes challenging, while the differences between distinct individuals are often subtle. To address this, we reduce the impact of clothing information on identity judgment by introducing linguistic modalities. Considering CLIP’s (Contrastive Language-Image Pre-training) ability to align high-level semantic information with visual features, we propose a CLIP-driven Semantic Guidance Network (CSGN), which consists of a Multi-Description Generator (MDG), a Visual Semantic Steering module (VSS), and a Heterogeneous Semantic Fusion loss (HSF). Specifically, to mitigate the color sensitivity of CLIP’s text encoder, we design the MDG to generate pseudo-text in both RGB and grayscale modalities, incorporating a combined loss function for text-image mutuality. This helps reduce the encoder’s bias towards color. Additionally, to improve the CLIP visual encoder’s ability to extract identity-independent features, we construct the VSS, which combines ResNet and ViT feature extractors to enhance visual feature extraction. Finally, recognizing the complementary nature of semantics in heterogeneous descriptions, we use HSF, which constrains visual features by focusing not only on pseudo-text derived from RGB but also on pseudo-text derived from grayscale, thereby mitigating the influence of clothing information. Experimental results show that our method outperforms existing state-of-the-art approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104406"},"PeriodicalIF":4.3,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144253992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous conditional video synthesis by neural processes","authors":"Xi Ye, Guillaume-Alexandre Bilodeau","doi":"10.1016/j.cviu.2025.104387","DOIUrl":"10.1016/j.cviu.2025.104387","url":null,"abstract":"<div><div>Different conditional video synthesis tasks, such as frame interpolation and future frame prediction, are typically addressed individually by task-specific models, despite their shared underlying characteristics. Additionally, most conditional video synthesis models are limited to discrete frame generation at specific integer time steps. This paper presents a unified model that tackles both challenges simultaneously. We demonstrate that conditional video synthesis can be formulated as a neural process, where input spatio-temporal coordinates are mapped to target pixel values by conditioning on context spatio-temporal coordinates and pixel values. Our approach leverages a Transformer-based non-autoregressive conditional video synthesis model that takes the implicit neural representation of coordinates and context pixel features as input. Our task-specific models outperform previous methods for future frame prediction and frame interpolation across multiple datasets. Importantly, our model enables temporal continuous video synthesis at arbitrary high frame rates, outperforming the previous state-of-the-art. The source code and video demos for our model are available at <span><span>https://npvp.github.io</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104387"},"PeriodicalIF":4.3,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsung-Shan Yang , Yun-Cheng Wang , Chengwei Wei , Suya You , C.-C. Jay Kuo
{"title":"Efficient human–object-interaction (EHOI) detection via interaction label coding and Conditional Decision","authors":"Tsung-Shan Yang , Yun-Cheng Wang , Chengwei Wei , Suya You , C.-C. Jay Kuo","doi":"10.1016/j.cviu.2025.104390","DOIUrl":"10.1016/j.cviu.2025.104390","url":null,"abstract":"<div><div>Human–Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method. The codes are available: <span><span>https://github.com/keevin60907/EHOI---Efficient-Human-Object-Interaction-Detector</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104390"},"PeriodicalIF":4.3,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144115541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung
{"title":"HVQ-VAE: Variational auto-encoder with hyperbolic vector quantization","authors":"Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung","doi":"10.1016/j.cviu.2025.104392","DOIUrl":"10.1016/j.cviu.2025.104392","url":null,"abstract":"<div><div>Vector quantized-variational autoencoder (VQ-VAE) and its variants have made significant progress in creating discrete latent space via learning a codebook. Previous works on VQ-VAE have focused on discrete latent spaces in Euclidean or in spherical spaces. This paper studies the geometric prior of hyperbolic spaces as a way to improve the learning capacity of VQ-VAE. That being said, working with the VQ-VAE in the hyperbolic space is not without difficulties, and the benefits of using hyperbolic space as the geometric prior for the latent space have never been studied in VQ-VAE. We bridge this gap by developing the VQ-VAE with hyperbolic vector quantization. To this end, we propose the hyperbolic VQ-VAE (HVQ-VAE), which learns the latent embedding of data and the codebook in the hyperbolic space. Specifically, we endow the discrete latent space in the Poincaré ball, such that the clustering algorithm can be formulated and optimized in the Poincaré ball. Thorough experiments against various baselines are conducted to evaluate the superiority of the proposed HVQ-VAE empirically. We show that HVQ-VAE enjoys better image reconstruction, effective codebook usage, and fast convergence than baselines. We also present evidence that HVQ-VAE outperforms VQ-VAE in low-dimensional latent space.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104392"},"PeriodicalIF":4.3,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pietro Ruiu , Marinella Iole Cadoni , Andrea Lagorio , Seth Nixon , Filippo Casu , Massimo Farina , Mauro Fadda , Giuseppe A. Trunfio , Massimo Tistarelli , Enrico Grosso
{"title":"Uniss-MDF: A Multidimensional Face dataset for assessing face analysis on the move","authors":"Pietro Ruiu , Marinella Iole Cadoni , Andrea Lagorio , Seth Nixon , Filippo Casu , Massimo Farina , Mauro Fadda , Giuseppe A. Trunfio , Massimo Tistarelli , Enrico Grosso","doi":"10.1016/j.cviu.2025.104384","DOIUrl":"10.1016/j.cviu.2025.104384","url":null,"abstract":"<div><div>Multidimensional 2D–3D face analysis has demonstrated a strong potential for human identification in several application domains. The combined, synergic use of 2D and 3D data from human faces can counteract typical limitations in 2D face recognition, while improving both accuracy and robustness in identification. On the other hand, current mobile devices, often equipped with depth cameras and high performance computing resources, offer a powerful and practical tool to better investigate new models to jointly process real 2D and 3D face data. However, recent concerns related to privacy of individuals and the collection, storage and processing of personally identifiable biometric information have diminished the availability of public face recognition datasets.</div><div>Uniss-MDF (Uniss-MultiDimensional Face) represents the first collection of combined 2D–3D data of human faces captured with a mobile device. Over 76,000 depth images and videos are captured from over 100 subjects, in both controlled and uncontrolled conditions, over two sessions. The features of Uniss-MDF are extensively compared with existing 2D–3D face datasets. The reported statistics underscore the value of the dataset as a versatile resource for researchers in face recognition on the move and for a wide range of applications. Notably, it is the sole 2D–3D facial dataset using data from a mobile device that includes both 2D and 3D synchronized sequences acquired in controlled and uncontrolled conditions. The Uniss-MDF dataset and the proposed experimental protocols with baseline results provide a new platform to compare processing models for novel research avenues in advanced face analysis on the move.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104384"},"PeriodicalIF":4.3,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}