Van Thong Huynh , Seungwon Kim , Hyung-Jeong Yang , Soo-Hyung Kim
{"title":"Multilevel spatial–temporal feature analysis for generic event boundary detection in videos","authors":"Van Thong Huynh , Seungwon Kim , Hyung-Jeong Yang , Soo-Hyung Kim","doi":"10.1016/j.cviu.2025.104429","DOIUrl":"10.1016/j.cviu.2025.104429","url":null,"abstract":"<div><div>Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we propose an approach that leverages multilevel spatial–temporal features to construct a framework for localizing generic events in videos. Our method capitalizes on the correlation between neighbor frames, employing a hierarchy of spatial and temporal features to create a comprehensive representation. Specifically, features from multiple spatial dimensions of a pre-trained ResNet-50 are combined with diverse temporal views, generating a multilevel spatial–temporal feature map. This map facilitates the calculation of similarities between neighbor frames, which are then projected to build a multilevel spatial–temporal similarity feature vector. Subsequently, a decoder employing 1D convolution operations deciphers these similarities, incorporating their temporal relationships to estimate boundary scores effectively. Extensive experiments conducted on the GEBD benchmark dataset demonstrate the superior performance of our system and its variants, outperforming state-of-the-art approaches. Furthermore, additional experiments conducted on the TAPOS dataset, comprising long-form videos with Olympic sport actions, reaffirm the efficacy of our proposed methodology compared to existing techniques.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104429"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Huang , Guangyin Zhang , Zixu Li , Keying Liu , Wenguang Luo
{"title":"Light-YOLO: A lightweight and high-performance network for detecting small obstacles on roads at night","authors":"Dan Huang , Guangyin Zhang , Zixu Li , Keying Liu , Wenguang Luo","doi":"10.1016/j.cviu.2025.104428","DOIUrl":"10.1016/j.cviu.2025.104428","url":null,"abstract":"<div><div>To address the challenges of detecting small obstacles and model portability, this study proposes a lightweight, high-precision, and high-speed small obstacle detection network at nighttime road environments referred to as Light-YOLO. First, the SPDConvMobileNetV3 feature extraction network is introduced, which significantly reduces the total number of parameters while enhancing the ability to capture small obstacle details. Next, to make the network more focused on small obstacles at nighttime conditions, a loss function called Wise-IoU is incorporated, which is more suitable to low-quality images. Finally, to improve overall model performance without increasing the total number of parameters, a parameter-free attention mechanism (SimAM) is integrated. By comparing the publicly available data with the self-built dataset, the experimental results show that Light-YOLO achieves a mean average precision (<span><math><mrow><mi>m</mi><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>) of 97.1% while maintaining a high image processing speed. Additionally, compared to other advanced models in the same series, Light-YOLO has fewer parameters, a smaller computational load (GFLOPs), and reduced model weight (Best.pt). Overall, Light-YOLO strikes a balance between lightweight design, accuracy, and speed, making it more suitable for hardware-constrained devices.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104428"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei
{"title":"GraPLUS: Graph-based Placement Using Semantics for image composition","authors":"Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei","doi":"10.1016/j.cviu.2025.104427","DOIUrl":"10.1016/j.cviu.2025.104427","url":null,"abstract":"<div><div>We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104427"},"PeriodicalIF":4.3,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu
{"title":"STDepth: Leveraging semantic-textural information in transformers for self-supervised monocular depth estimation","authors":"Xuanang Gao , Bingchao Wang , Zhiwei Ning , Jie Yang , Wei Liu","doi":"10.1016/j.cviu.2025.104422","DOIUrl":"10.1016/j.cviu.2025.104422","url":null,"abstract":"<div><div>Self-supervised monocular depth estimation, relying solely on monocular or stereo video for supervision, plays an important role in computer vision. The encoder backbone generates features at various stages, and each stage exhibits distinct properties. However, conventional methods fail to take full advantage of these distinctions and apply the same processing to features from different stages, lacking the adaptability required for aggregating the unique information in features. In this research, we replace convolutional neural networks (CNNs) with a Transformer as the encoder backbone, intending to enhance the model’s ability to encode long-range spatial dependencies. Furthermore, we introduce a semantic-textural decoder (STDec) to emphasize local critical regions and more effectively process intricate details. The STDec incorporates two principal modules: (1) the global feature recalibration (GFR) module, which performs a comprehensive analysis of the scene structure using high-level features, and recalibrates features in the spatial dimension through semantic information, and (2) the detail focus (DF) module is employed in low-level features to capture texture details precisely. Additionally, we propose an innovative multi-arbitrary-scale reconstruction loss (MAS Loss) function to fully exploit the depth estimation network’s capabilities. The extensive experimental results demonstrate that our method achieves state-of-the-art performance on the KITTI dataset. Moreover, our models demonstrate remarkable generalization ability when applied to the Make3D and NYUv2 datasets. The code is publicly available at: <span><span>https://github.com/xagao/STDepth</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104422"},"PeriodicalIF":4.3,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144313918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep semantic segmentation for drivable area detection on unstructured roads","authors":"Xiangjun Mo, Yonghui Feng, Yihe Liu","doi":"10.1016/j.cviu.2025.104420","DOIUrl":"10.1016/j.cviu.2025.104420","url":null,"abstract":"<div><div>Drivable area detection on unstructured roads is crucial for autonomous driving, as it provides path planning constraints for end-to-end models and enhances driving safety. This paper proposes a deep learning approach for drivable area detection on unstructured roads using semantic segmentation. The deep learning approach is based on the DeepLabv3+ network and incorporates a Unit Attention Module following the Atrous Spatial Pyramid Pooling Module in the encoder. The Unit Attention Module combines a dual attention module and a spatial attention module. It enhances the adaptive weighting of semantic information in key channels and spatial locations, thereby improving the overall segmentation accuracy of drivable areas on unstructured roads. Evaluations on the India Driving Dataset demonstrate that the proposed network consistently surpasses most comparative methods, achieving a mean IoU of 85.99% and a mean pixel accuracy of 92.01%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104420"},"PeriodicalIF":4.3,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang
{"title":"MSCA: A few-shot segmentation framework driven by multi-scale cross-attention and information extraction","authors":"Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang","doi":"10.1016/j.cviu.2025.104419","DOIUrl":"10.1016/j.cviu.2025.104419","url":null,"abstract":"<div><div>Few-Shot Semantic Segmentation (FSS) aims to achieve precise pixel-level segmentation of target objects in query images using only a small number of annotated support images. The main challenge lies in effectively capturing and transferring critical information from support samples while establishing fine-grained semantic associations between query and support images to improve segmentation accuracy. However, existing methods struggle with spatial alignment issues caused by intra-class variations and inter-class visual similarities, and they fail to fully integrate high-level and low-level decoder features. To address these limitations, we propose a novel framework based on cross-scale interactive attention mechanisms. This framework employs a hybrid mask-guided multi-scale feature fusion strategy, constructing a cross-scale attention network that spans from local details to global context. It dynamically enhances target region representation and alleviates spatial misalignment issues. Furthermore, we design a hierarchical multi-axis decoding architecture that progressively integrates multi-resolution feature pathways, enabling the model to focus on semantic associations within foreground regions. Experimental results show that our Multi-Scale Cross-Attention (MSCA) model performs exceptionally well on the PASCAL-5i and COCO-20i benchmark datasets, achieving highly competitive results. Notably, the model contains only 1.86 million learnable parameters, demonstrating its efficiency and practical applicability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104419"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yike Zhan , Baolin Zheng , Dongxin Liu , Boren Deng , Xu Yang
{"title":"Exploring black-box adversarial attacks on Interpretable Deep Learning Systems","authors":"Yike Zhan , Baolin Zheng , Dongxin Liu , Boren Deng , Xu Yang","doi":"10.1016/j.cviu.2025.104423","DOIUrl":"10.1016/j.cviu.2025.104423","url":null,"abstract":"<div><div>Recent studies have empirically demonstrated that neural network interpretability is susceptible to malicious manipulations. However, existing attacks on Interpretable Deep Learning Systems (IDLSes) predominantly focus on the white-box setting, which is impractical for real-world applications. In this paper, we present the first attempt to attack IDLSes in more challenging and realistic black-box settings. We introduce a novel framework called Dual Black-box Adversarial Attack (DBAA) which can generate adversarial examples that are misclassified as the target class, while maintaining interpretations similar to their benign counterparts. In our method, adversarial examples are generated via black-box adversarial attacks and then refined using ADV-Plugin, a novel approach proposed in this paper, which employs single-pixel perturbation and an adaptive step-size algorithm to enhance explanation similarity with benign samples while preserving adversarial properties. We conduct extensive experiments on multiple datasets (CIFAR-10, ImageNet, and Caltech-101) and various combinations of classifiers and interpreters, comparing our approach against five baseline methods. Empirical results indicate that DBAA is comparable to regular adversarial attacks in compromising classifiers and significantly enhances interpretability deception. Specifically, DBAA achieves Intersection over Union (IoU) scores exceeding 0.5 across all interpreters, approximately doubling the performance of regular attacks, while concurrently reducing the average <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> distance between its attribution maps and those of benign samples by about 50%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104423"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144298018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing vision–language contrastive representation learning using domain knowledge","authors":"Xiaoyang Wei, Camille Kurtz, Florence Cloppet","doi":"10.1016/j.cviu.2025.104403","DOIUrl":"10.1016/j.cviu.2025.104403","url":null,"abstract":"<div><div>Visual representation learning plays a key role in solving medical computer vision tasks. Recent advances in the literature often rely on vision–language models aiming to learn the representation of medical images from the supervision of paired captions in a label-free manner. The training of such models is however very data/time intensive and the alignment strategies involved in the contrastive loss functions may not capture the full richness of information carried by inter-data relationships. We assume here that considering expert knowledge from the medical domain can provide solutions to these problems during model optimization. To this end, we propose a novel knowledge-augmented vision–language contrastive representation learning framework consisting of the following steps: (1) Modeling the hierarchical relationships between various medical concepts using expert knowledge and medical images in a dataset through a knowledge graph, followed by translating each node into a knowledge embedding; And (2) integrating knowledge embeddings into a vision–language contrastive learning framework, either by introducing an additional alignment loss between visual and knowledge embeddings or by relaxing binary constraints of vision–language alignment using knowledge embeddings. Our results demonstrate that the proposed solution achieves competitive performances against state-of-the-art approaches for downstream tasks while requiring significantly less training data. Our code is available at <span><span>https://github.com/Wxy-24/KL-CVR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104403"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144313917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Maximum redundancy pruning for network compression","authors":"Chang Gao, Jiaqi Wang, Liping Jing","doi":"10.1016/j.cviu.2025.104404","DOIUrl":"10.1016/j.cviu.2025.104404","url":null,"abstract":"<div><div>Filter pruning has become one of the most powerful methods for model compression in recent years. However, existing pruning methods often rely on predefined layer-wise pruning ratios or computationally expensive search processes, leading to suboptimal architectures and high computational overhead. To address these limitations, we propose a novel pruning method, termed Maximum Redundancy Pruning (MRP), which consists of Redundancy Measurement by Community Detection (RMCD) and Structural Redundancy Pruning (SRP). We first demonstrate a Role-Information (RI) hypothesis based on the link between social networks and convolutional neural networks through empirical study. Based on that, RMCD is proposed to obtain the level of redundancy for each layer, enabling adaptive pruning without predefined layer-wise ratios. In addition, we introduce SRP to obtain a sub-network with the optimal architecture according to the redundancy of each layer obtained by RMCD. Specifically, we recalculate the redundancy of each layer at each iteration and then remove the most replaceable filters in the most redundant layer until a target compression ratio is achieved. This approach automatically determines the optimal layer-wise pruning ratios, avoiding the limitations of uniform pruning or expensive architecture search. We show that our proposed MRP method can reduce the model size for ResNet-110 by up to 52.4% and FLOPs by up to 50.3% on CIFAR-10 while actually improving the original accuracy by 1.04% after retraining the networks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104404"},"PeriodicalIF":4.3,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang
{"title":"Progressive Reverse Attention Network for image inpainting detection and localization","authors":"Shuai Liu , Jiyou Chen , Xiangling Ding , Gaobo Yang","doi":"10.1016/j.cviu.2025.104407","DOIUrl":"10.1016/j.cviu.2025.104407","url":null,"abstract":"<div><div>Image inpainting is originally presented to restore damaged image areas, but it might be maliciously used for object removal that change image semantic content. This easily leads to serious public confidence crises. Up to present, image inpainting forensics works have achieved remarkable results, but they usually ignore or fail to capture subtle artifacts near object boundary, resulting in inaccurate object mask localization. To address this issue, we propose a Progressive Reverse Attention Network (PRA-Net) for image inpainting detection and localization. Different from the traditional convolutional neural networks (CNN) structure, PRA-Net follows an encoder and decoder architecture. The encoder leverages features at different scales with dense cross-connections to locate inpainted regions and generates global map with our designed multi-scale extraction module. A reverse attention module is used as the backbone of the decoder to progressively refine the details of predictions. Experimental results show that PRA-Net achieves accurate image inpainting localization and desirable robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104407"},"PeriodicalIF":4.3,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144291075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}