Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu
{"title":"STDepth: Leveraging semantic-textural information in transformers for self-supervised monocular depth estimation","authors":"Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu","doi":"10.1016/j.cviu.2025.104422","DOIUrl":"10.1016/j.cviu.2025.104422","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: <span><span>https://github.com/xagao/STDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104422"},"PeriodicalIF":4.3,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144313918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep semantic segmentation for drivable area detection on unstructured roads","authors":"Xiangjun Mo, Yonghui Feng, Yihe Liu","doi":"10.1016/j.cviu.2025.104420","DOIUrl":"10.1016/j.cviu.2025.104420","url":null,"abstract":"<div><div>Drivable area detection on unstructured roads is crucial for autonomous driving, as it provides path planning constraints for end-to-end models and enhances driving safety. This paper proposes a deep learning approach for drivable area detection on unstructured roads using semantic segmentation. The deep learning approach is based on the DeepLabv3+ network and incorporates a Unit Attention Module following the Atrous Spatial Pyramid Pooling Module in the encoder. The Unit Attention Module combines a dual attention module and a spatial attention module. It enhances the adaptive weighting of semantic information in key channels and spatial locations, thereby improving the overall segmentation accuracy of drivable areas on unstructured roads. Evaluations on the India Driving Dataset demonstrate that the proposed network consistently surpasses most comparative methods, achieving a mean IoU of 85.99% and a mean pixel accuracy of 92.01%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104420"},"PeriodicalIF":4.3,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DiffMatter: Different frequency fusion for trimap-free image matting via edge detection","authors":"Anming Sun , Junjie Chang , Guilin Yao","doi":"10.1016/j.cviu.2025.104424","DOIUrl":"10.1016/j.cviu.2025.104424","url":null,"abstract":"<div><div>Image matting extracts the foreground from the target image by predicting the alpha transparency of the foreground. Existing methods rely on constraints such as trimaps to distinguish the foreground from the background in the image, which, while improving accuracy, inevitably incurs significant costs. This paper proposes a trimap-free automatic matting method that highlights the foreground area through edge detection. To address the domain adaptation issues of edge information and the fine-grained features required for matting task, we designed a plug-and-play <strong>D</strong>ifferent <strong>F</strong>requency <strong>F</strong>usion module according to the paradigm of characteristic enhancement, feature fusion, and information integration to effectively combine high-frequency components with low-frequency counterparts and propose a matting model, DiffMatter. Specifically, we designed texture highlighting and semantic enhancement modules for high-frequency and low-frequency information during the characteristic enhancement phase. For feature fusion, we employed cross-fusion operations, and in the information integration phase, we integrated information across spatial and channel dimensions. Additionally, to compensate for the shortcoming of transformer in capturing local information, we construct an attention embedding module and propose a cross-aware module to utilize channel and spatial information, respectively, to enhance representational capability. Experimental results on the Composition-1k, Distinctions, 646, and real-world AIM-500 datasets demonstrate that our model outperforms competing methods, achieving a balance between performance and computational efficiency. Furthermore, our different frequency fusion module enhances several state-of-the-art matting models. The code will be publicly released.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104424"},"PeriodicalIF":4.3,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144489379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang
{"title":"MSCA: A few-shot segmentation framework driven by multi-scale cross-attention and information extraction","authors":"Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang","doi":"10.1016/j.cviu.2025.104419","DOIUrl":"10.1016/j.cviu.2025.104419","url":null,"abstract":"<div><div>Few-Shot Semantic Segmentation (FSS) aims to achieve precise pixel-level segmentation of target objects in query images using only a small number of annotated support images. The main challenge lies in effectively capturing and transferring critical information from support samples while establishing fine-grained semantic associations between query and support images to improve segmentation accuracy. However, existing methods struggle with spatial alignment issues caused by intra-class variations and inter-class visual similarities, and they fail to fully integrate high-level and low-level decoder features. To address these limitations, we propose a novel framework based on cross-scale interactive attention mechanisms. This framework employs a hybrid mask-guided multi-scale feature fusion strategy, constructing a cross-scale attention network that spans from local details to global context. It dynamically enhances target region representation and alleviates spatial misalignment issues. Furthermore, we design a hierarchical multi-axis decoding architecture that progressively integrates multi-resolution feature pathways, enabling the model to focus on semantic associations within foreground regions. Experimental results show that our Multi-Scale Cross-Attention (MSCA) model performs exceptionally well on the PASCAL-5i and COCO-20i benchmark datasets, achieving highly competitive results. Notably, the model contains only 1.86 million learnable parameters, demonstrating its efficiency and practical applicability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104419"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yike Zhan , Baolin Zheng , Dongxin Liu , Boren Deng , Xu Yang
{"title":"Exploring black-box adversarial attacks on Interpretable Deep Learning Systems","authors":"Yike Zhan , Baolin Zheng , Dongxin Liu , Boren Deng , Xu Yang","doi":"10.1016/j.cviu.2025.104423","DOIUrl":"10.1016/j.cviu.2025.104423","url":null,"abstract":"<div><div>Recent studies have empirically demonstrated that neural network interpretability is susceptible to malicious manipulations. However, existing attacks on Interpretable Deep Learning Systems (IDLSes) predominantly focus on the white-box setting, which is impractical for real-world applications. In this paper, we present the first attempt to attack IDLSes in more challenging and realistic black-box settings. We introduce a novel framework called Dual Black-box Adversarial Attack (DBAA) which can generate adversarial examples that are misclassified as the target class, while maintaining interpretations similar to their benign counterparts. In our method, adversarial examples are generated via black-box adversarial attacks and then refined using ADV-Plugin, a novel approach proposed in this paper, which employs single-pixel perturbation and an adaptive step-size algorithm to enhance explanation similarity with benign samples while preserving adversarial properties. We conduct extensive experiments on multiple datasets (CIFAR-10, ImageNet, and Caltech-101) and various combinations of classifiers and interpreters, comparing our approach against five baseline methods. Empirical results indicate that DBAA is comparable to regular adversarial attacks in compromising classifiers and significantly enhances interpretability deception. Specifically, DBAA achieves Intersection over Union (IoU) scores exceeding 0.5 across all interpreters, approximately doubling the performance of regular attacks, while concurrently reducing the average <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> distance between its attribution maps and those of benign samples by about 50%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104423"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing vision–language contrastive representation learning using domain knowledge","authors":"Xiaoyang Wei, Camille Kurtz, Florence Cloppet","doi":"10.1016/j.cviu.2025.104403","DOIUrl":"10.1016/j.cviu.2025.104403","url":null,"abstract":"<div><div>Visual representation learning plays a key role in solving medical computer vision tasks. Recent advances in the literature often rely on vision–language models aiming to learn the representation of medical images from the supervision of paired captions in a label-free manner. The training of such models is however very data/time intensive and the alignment strategies involved in the contrastive loss functions may not capture the full richness of information carried by inter-data relationships. We assume here that considering expert knowledge from the medical domain can provide solutions to these problems during model optimization. To this end, we propose a novel knowledge-augmented vision–language contrastive representation learning framework consisting of the following steps: (1) Modeling the hierarchical relationships between various medical concepts using expert knowledge and medical images in a dataset through a knowledge graph, followed by translating each node into a knowledge embedding; And (2) integrating knowledge embeddings into a vision–language contrastive learning framework, either by introducing an additional alignment loss between visual and knowledge embeddings or by relaxing binary constraints of vision–language alignment using knowledge embeddings. Our results demonstrate that the proposed solution achieves competitive performances against state-of-the-art approaches for downstream tasks while requiring significantly less training data. Our code is available at <span><span>https://github.com/Wxy-24/KL-CVR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104403"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144313917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum redundancy pruning for network compression","authors":"Chang Gao, Jiaqi Wang, Liping Jing","doi":"10.1016/j.cviu.2025.104404","DOIUrl":"10.1016/j.cviu.2025.104404","url":null,"abstract":"<div><div>Filter pruning has become one of the most powerful methods for model compression in recent years. However, existing pruning methods often rely on predefined layer-wise pruning ratios or computationally expensive search processes, leading to suboptimal architectures and high computational overhead. To address these limitations, we propose a novel pruning method, termed Maximum Redundancy Pruning (MRP), which consists of Redundancy Measurement by Community Detection (RMCD) and Structural Redundancy Pruning (SRP). We first demonstrate a Role-Information (RI) hypothesis based on the link between social networks and convolutional neural networks through empirical study. Based on that, RMCD is proposed to obtain the level of redundancy for each layer, enabling adaptive pruning without predefined layer-wise ratios. In addition, we introduce SRP to obtain a sub-network with the optimal architecture according to the redundancy of each layer obtained by RMCD. Specifically, we recalculate the redundancy of each layer at each iteration and then remove the most replaceable filters in the most redundant layer until a target compression ratio is achieved. This approach automatically determines the optimal layer-wise pruning ratios, avoiding the limitations of uniform pruning or expensive architecture search. We show that our proposed MRP method can reduce the model size for ResNet-110 by up to 52.4% and FLOPs by up to 50.3% on CIFAR-10 while actually improving the original accuracy by 1.04% after retraining the networks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104404"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang
{"title":"Progressive Reverse Attention Network for image inpainting detection and localization","authors":"Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang","doi":"10.1016/j.cviu.2025.104407","DOIUrl":"10.1016/j.cviu.2025.104407","url":null,"abstract":"<div><div>Image inpainting is originally presented to restore damaged image areas, but it might be maliciously used for object removal that change image semantic content. This easily leads to serious public confidence crises. Up to present, image inpainting forensics works have achieved remarkable results, but they usually ignore or fail to capture subtle artifacts near object boundary, resulting in inaccurate object mask localization. To address this issue, we propose a Progressive Reverse Attention Network (PRA-Net) for image inpainting detection and localization. Different from the traditional convolutional neural networks (CNN) structure, PRA-Net follows an encoder and decoder architecture. The encoder leverages features at different scales with dense cross-connections to locate inpainted regions and generates global map with our designed multi-scale extraction module. A reverse attention module is used as the backbone of the decoder to progressively refine the details of predictions. Experimental results show that PRA-Net achieves accurate image inpainting localization and desirable robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104407"},"PeriodicalIF":4.3,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144291075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SASFNet: Soft-edge awareness and spatial-attention feedback deep network for blind image deblurring","authors":"Jing Cheng , Kaibing Zhang , Jiahui Hou , Yuhong Zhang , Guang Shi","doi":"10.1016/j.cviu.2025.104408","DOIUrl":"10.1016/j.cviu.2025.104408","url":null,"abstract":"<div><div>When a camera is used to capture moving objects in natural scenes, the obtained images will be degraded to varying degrees due to camera shaking and object displacement, which is called motion blurring. Moreover, the complexity of natural scenes makes the image motion deblurring more challenging. Now, there are two crucial problems in Deep Learning-based methods for blind motion deblurring: (1) how to restore sharp images with fine textures, and (2) how to improve the generalization of the model. In this paper, we propose Soft-edge Awareness and Spatial-attention Feedback deep Network (SASFNet) to restore sharp images. First, we restore images with fine textures using a soft-edge assist mechanism. This mechanism uses the soft edge extraction network to map the fine edge information from the blurred image to assist the model to restore the high-quality clear image. Second, for the generalization of the model, we propose feedback mechanism with attention. Similar to course learning, feedback mechanism imitates the human learning process, learning from easy to difficult to restore sharp images, which not only refines the restored features, but also brings better generalization. To evaluate the model, we use the GoPro dataset for model training and validity testing, and the Realblur dataset to test the generalization of the model. Experiments show that our proposed SASFNet can not only restore sharp images that are more in line with human perception, but also has good generalization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104408"},"PeriodicalIF":4.3,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144240484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast self-supervised 3D mesh object retrieval for geometric similarity","authors":"Kajal Sanklecha, Prayushi Mathur, P.J. Narayanan","doi":"10.1016/j.cviu.2025.104405","DOIUrl":"10.1016/j.cviu.2025.104405","url":null,"abstract":"<div><div>Digital 3D models play a pivotal role in engineering, entertainment, education, and various domains. However, the search and retrieval of these models have not received adequate attention compared to other digital assets like documents and images. Traditional supervised methods face challenges in scalability due to the impracticality of creating large, labeled collections of 3D objects. In response, this paper introduces a self-supervised approach to generate efficient embeddings for 3D mesh objects, facilitating ranked retrieval of similar objects. The proposed method employs a straightforward representation of mesh objects and utilizes an encoder–decoder architecture to learn the embedding. Extensive experiments demonstrate the competitiveness of our approach compared to supervised methods, showcasing its scalability across diverse object collections. Notably, the method exhibits transferability across datasets, implying its potential for broader applicability beyond the training dataset. The robustness and generalization capabilities of the proposed method are substantiated through experiments conducted on varied datasets. These findings underscore the efficacy of the approach in capturing underlying patterns and features, independent of dataset-specific nuances. This self-supervised framework offers a promising solution for enhancing the search and retrieval of 3D models, addressing key challenges in scalability and dataset transferability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104405"},"PeriodicalIF":4.3,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}