Zhiyan Wang , Song Wang , Lin Yuanbo Wu , Deyin Liu , Lei Gao , Lin Qi , Guanghui Wang
{"title":"EntroFormer: An entropy-based sparse vision transformer for real-time semantic segmentation","authors":"Zhiyan Wang , Song Wang , Lin Yuanbo Wu , Deyin Liu , Lei Gao , Lin Qi , Guanghui Wang","doi":"10.1016/j.cviu.2025.104482","DOIUrl":"10.1016/j.cviu.2025.104482","url":null,"abstract":"<div><div>Image semantic segmentation plays a fundamental role in a wide range of pixel-level scene understanding tasks. State-of-the-art segmentation methods often leverage sparse attention mechanisms to identify informative patches for modeling long-range dependencies, significantly reducing the computational complexity of Vision Transformers. Most of these methods focus on selecting regions that are highly relevant to the queries, achieving strong performance in tasks like classification and object detection. However, in the semantic segmentation task, current sparse attention methods are limited by their query-based focus, overlooking the importance of interactions between different objects. In this paper, we propose Sparse Entropy Attention (SEA) to select regions with higher informational content for long-range dependency capture. Specifically, the information entropy of each region is computed to assess its uncertainty in semantic prediction. Regions with high information entropy are considered informative and selected to explore sparse global semantic dependencies. Based on SEA, we present an entropy-based sparse Vision Transformer (EntroFormer) network for real-time semantic segmentation. EntroFormer integrates sparse global semantic features with dense local ones, enhancing the network’s ability to capture both the interaction of image contents and specific semantics. Experimental results show that the proposed real-time network outperforms state-of-the-art methods with similar parameters and computational costs on the Cityscapes, COCO-Stuff, and Bdd100K datasets. Ablation studies further demonstrate that SEA outperforms other sparse attention mechanisms in semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104482"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144902260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CCANet: A Cross-scale Context Aggregation Network for UAV object detection","authors":"Lei Shang , Qihan He , Huan Lei , Wenyuan Yang","doi":"10.1016/j.cviu.2025.104472","DOIUrl":"10.1016/j.cviu.2025.104472","url":null,"abstract":"<div><div>With the rapid advancement of deep learning technology, Unmanned Aerial Vehicle (UAV) object detection demonstrates significant potential across various fields. However, multi-scale object variations and complex environmental interference in UAV images present considerable challenges. This paper proposes a new UAV object detection network named Cross-scale Context Aggregation Network (CCANet), which contains Multi-scale Convolution Aggregation Darknet (MCADarknet) and Cross-scale Context Aggregation Feature Pyramid Network (CCA-FPN). First, MCADarknet serves as a multi-scale feature extraction network. It employs parallel multi-scale convolutional kernels and depth-wise strip convolution techniques to expand the network’s receptive field, extracting feature maps at four different scales layer by layer. Second, to address interference in complex scenes, a Context Enhanced Fusion method enhances the interaction between adjacent features extracted by MCADarknet and higher-level features to form intermediate features. Finally, CCA-FPN employs a cross-scale fusion strategy to deeply integrate shallow, intermediate, and deep feature information, thereby enhancing object representation in complex scenarios. Experimental results indicate that CCANet performs well on three public datasets. In particular, <span><math><msub><mrow><mo>mAP</mo></mrow><mrow><mn>50</mn></mrow></msub></math></span> and <span><math><msub><mrow><mo>mAP</mo></mrow><mrow><mn>50</mn><mo>−</mo><mn>95</mn></mrow></msub></math></span> can reach 47.4% and 29.4% respectively on the VisDrone dataset. Compared to the baseline model, it achieves improvements of 6.2% and 4.3%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104472"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenquan Gan , Daitao Zhou , Kexin Wang , Qingyi Zhu , Deepak Kumar Jain , Vitomir Štruc
{"title":"Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy","authors":"Chenquan Gan , Daitao Zhou , Kexin Wang , Qingyi Zhu , Deepak Kumar Jain , Vitomir Štruc","doi":"10.1016/j.cviu.2025.104483","DOIUrl":"10.1016/j.cviu.2025.104483","url":null,"abstract":"<div><div>Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104483"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144902128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bi-granularity balance learning for long-tailed image classification","authors":"Ning Ren , Xiaosong Li , Yanxia Wu , Yan Fu","doi":"10.1016/j.cviu.2025.104469","DOIUrl":"10.1016/j.cviu.2025.104469","url":null,"abstract":"<div><div>In long-tailed datasets, the training of deep neural network-based models faces challenges, where the model may become biased towards the head classes with abundant training data, resulting in poor performance on tail classes with limited samples. Most current methods employ contrastive learning to learn more balanced representations by finding the class center. However, these methods use class centers to address local imbalance within a mini-batch, they overlook the global imbalance between batches throughout an epoch, caused by the long-tailed distribution of the dataset. In this paper, we propose <strong>bi-granularity balance</strong> learning to address the two-layer imbalance. We decouple the attraction–repulsion term in contrastive loss into two independent components: global and local balance. The global balance component focuses on capturing semantic information from different perspectives of the image and shifting learning attention from the head classes to the tail classes in the global perspective. The local balance component aims to learn inter-class separability from the local perspective. The proposed method efficiently learns the intra-class compactness and inter-class separability in long-tailed model training and improves the performance of the long-tailed model. Experimental results show that the proposed method achieves competitive performance on long-tailed benchmarks such as CIFAR-10/100-LT, TinyImageNet-LT, and iNaturalist 2018.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104469"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Representation learning of point cloud upsampling in global and local inputs","authors":"Tongxu Zhang , Bei Wang","doi":"10.1016/j.cviu.2025.104467","DOIUrl":"10.1016/j.cviu.2025.104467","url":null,"abstract":"<div><div>In recent years, point cloud upsampling has been widely applied in tasks such as 3D reconstruction and object recognition. This study proposed a novel framework, ReLPU, which enhances upsampling performance by explicitly learning from both global and local structural features of point clouds. Specifically, we extracted global features from uniformly segmented inputs (Average Segments) and local features from patch-based inputs of the same point cloud. These two types of features were processed through parallel autoencoders, fused, and then fed into a shared decoder for upsampling. This dual-input design improved feature completeness and cross-scale consistency, especially in sparse and noisy regions. Our framework was applied to several state-of-the-art autoencoder-based networks and validated on standard datasets. Experimental results demonstrated consistent improvements in geometric fidelity and robustness. In addition, saliency maps confirmed that parallel global-local learning significantly enhanced the interpretability and performance of point cloud upsampling.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104467"},"PeriodicalIF":3.5,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAVE: Segment Audio-Visual Easy way using the Segment Anything Model","authors":"Khanh-Binh Nguyen, Chae Jung Park","doi":"10.1016/j.cviu.2025.104460","DOIUrl":"10.1016/j.cviu.2025.104460","url":null,"abstract":"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104460"},"PeriodicalIF":3.5,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DBiSeNet: Dual bilateral segmentation network for real-time semantic segmentation","authors":"Xiaobo Hu , Hongbo Zhu , Ning Su , Taosheng Xu","doi":"10.1016/j.cviu.2025.104461","DOIUrl":"10.1016/j.cviu.2025.104461","url":null,"abstract":"<div><div>Bilateral networks have shown effectiveness and efficiency for real-time semantic segmentation. However, the single bilateral architecture exhibits limitations in capturing multi-scale feature representations and addressing misalignment issues during spatial and contextual feature fusion, thereby constraining segmentation accuracy. To address these challenges, we propose a novel dual bilateral segmentation network (DBiSeNet) that incorporates an additional bilateral branch into the original architecture. The additional (high-scale) bilateral operating at high resolution to preserve fine-grained details and responsible for thin object prediction, while the original (low-scale) bilateral maintains an enlarged receptive field to capture global context for large object segmentation. Furthermore, we introduce an aligned and refined feature fusion module to mitigate feature misalignment within each bilateral branch. To optimize the final prediction, we design a dual prediction fusion module that utilizes the low-scale segmentation results as a baseline and adaptively incorporates complementary information from high-scale predictions. Extensive experiments on the Cityscapes and CamVid datasets validate the effectiveness of DBiSeNet in achieving an optimal balance between accuracy and inference speed. In particular, on a single RTX3090 GPU, DBiSeNet2 yields 75.6% mIoU at 225.9 FPS on Cityscapes test set and 75.7% mIoU at 203.4 FPS on CamVid test set.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104461"},"PeriodicalIF":3.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiyu Lin , Hanbin Lin , Liqiang Lin , Shuwu Chen , Xiaolong Liu
{"title":"Robust cross-image adversarial watermark with JPEG resistance for defending against Deepfake models","authors":"Zhiyu Lin , Hanbin Lin , Liqiang Lin , Shuwu Chen , Xiaolong Liu","doi":"10.1016/j.cviu.2025.104459","DOIUrl":"10.1016/j.cviu.2025.104459","url":null,"abstract":"<div><div>The widespread convenience of generative models has exacerbated the misuse of attribute-editing-based Deepfake technologies, leading to the proliferation of illegally generated content that severely threatens personal privacy and security. Existing proactive defense strategies mitigate Deepfake attacks by embedding imperceptible adversarial watermarks into the spatial-domain of protected images. However, spatial-domain adversarial watermarks are inherently sensitive to lossy compression operations, which significantly degrades their defense efficacy. To address this limitation, we propose a frequency-domain cross-image adversarial watermark generation scheme to enhance robustness toward JPEG compression. In the proposed method, the adversarial watermark training process is migrated to the frequency domain using a differentiable JPEG module, which explicitly simulates the impact of quantization and compression on perturbation distributions. Furthermore, a fusion module is incorporated to coordinate watermark distributions across images, thereby enhancing the generalization of the defense. Experimental results demonstrate that the generated adversarial watermarks exhibit strong robustness against JPEG compression and effectively disrupt the outputs of Deepfake models. Moreover, the proposed scheme can be directly applied to diverse facial images without retraining, thereby providing reliable protection for real-world image application scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104459"},"PeriodicalIF":3.5,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyan Xing , Tao Hu , Jan Hendrik Metzen , Konrad Groh , Sezer Karaoglu , Theo Gevers
{"title":"Training-free diffusion for controlling illumination conditions in images","authors":"Xiaoyan Xing , Tao Hu , Jan Hendrik Metzen , Konrad Groh , Sezer Karaoglu , Theo Gevers","doi":"10.1016/j.cviu.2025.104450","DOIUrl":"10.1016/j.cviu.2025.104450","url":null,"abstract":"<div><div>This paper introduces a novel approach to illumination manipulation in diffusion models, addressing the gap in conditional image generation with a focus on lighting conditions. While most of methods employ ControlNet and its variants to address the illumination-aware guidance in diffusion models. In contrast, We conceptualize the diffusion model as a black-box image render and strategically decompose its energy function in alignment with the image formation model. Our method effectively separates and controls illumination-related properties during the generative process. It generates images with realistic illumination effects, including cast shadow, soft shadow, and inter-reflections. Remarkably, it achieves this without the necessity for learning intrinsic decomposition, finding directions in latent space, or undergoing additional training with new datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104450"},"PeriodicalIF":3.5,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang
{"title":"Multiscale Spatio-Temporal Fusion Network for video dehazing","authors":"Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang","doi":"10.1016/j.cviu.2025.104462","DOIUrl":"10.1016/j.cviu.2025.104462","url":null,"abstract":"<div><div>Video dehazing aims to restore high-resolution and high-contrast haze-free frames, which is crucial in engineering applications such as intelligent traffic monitoring systems. These monitoring systems heavily rely on clear visual information to ensure accurate decision-making and reliable operation. However, despite significant advances achieved by deep learning methods, they still face challenges when dealing with diverse real-world scenarios. To address these issues, we propose a Multi-Scale Spatio-Temporal Fusion Network (MSTF-Net), a novel framework designed to enhance video dehazing performance in complex engineering environments. Specifically, the MainAux Encoder integrates multi-source information through a progressively enhanced feature fusion mechanism, improving the representation of both global dynamics and local details. Furthermore, the Spatio-Temporal Adaptive Fusion (STAF) module ensures robust temporal consistency and spatial clarity by leveraging multi-level spatio-temporal information fusion. To evaluate our framework, we constructed a challenging dataset named “DarkRoad”, which includes low-light, uneven lighting, and dynamic outdoor scenarios, addressing the key limitations of existing datasets in video dehazing tasks. Extensive experiments demonstrate that MSTF-Net achieves state-of-the-art performance, excelling particularly in applications requiring high clarity, strong contrast, and detailed preservation, providing a reliable solution to video dehazing problems in practical engineering scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104462"},"PeriodicalIF":3.5,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}