Computer Vision and Image Understanding最新文献_第4页

EntroFormer: An entropy-based sparse vision transformer for real-time semantic segmentation EntroFormer：一个基于熵的稀疏视觉转换器，用于实时语义分割

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104482

Zhiyan Wang , Song Wang , Lin Yuanbo Wu , Deyin Liu , Lei Gao , Lin Qi , Guanghui Wang

{"title":"EntroFormer: An entropy-based sparse vision transformer for real-time semantic segmentation","authors":"Zhiyan Wang , Song Wang , Lin Yuanbo Wu , Deyin Liu , Lei Gao , Lin Qi , Guanghui Wang","doi":"10.1016/j.cviu.2025.104482","DOIUrl":"10.1016/j.cviu.2025.104482","url":null,"abstract":"<div><div>Image semantic segmentation plays a fundamental role in a wide range of pixel-level scene understanding tasks. State-of-the-art segmentation methods often leverage sparse attention mechanisms to identify informative patches for modeling long-range dependencies, significantly reducing the computational complexity of Vision Transformers. Most of these methods focus on selecting regions that are highly relevant to the queries, achieving strong performance in tasks like classification and object detection. However, in the semantic segmentation task, current sparse attention methods are limited by their query-based focus, overlooking the importance of interactions between different objects. In this paper, we propose Sparse Entropy Attention (SEA) to select regions with higher informational content for long-range dependency capture. Specifically, the information entropy of each region is computed to assess its uncertainty in semantic prediction. Regions with high information entropy are considered informative and selected to explore sparse global semantic dependencies. Based on SEA, we present an entropy-based sparse Vision Transformer (EntroFormer) network for real-time semantic segmentation. EntroFormer integrates sparse global semantic features with dense local ones, enhancing the network’s ability to capture both the interaction of image contents and specific semantics. Experimental results show that the proposed real-time network outperforms state-of-the-art methods with similar parameters and computational costs on the Cityscapes, COCO-Stuff, and Bdd100K datasets. Ablation studies further demonstrate that SEA outperforms other sparse attention mechanisms in semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104482"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144902260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CCANet: A Cross-scale Context Aggregation Network for UAV object detection CCANet：用于无人机目标检测的跨尺度上下文聚合网络

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104472

Lei Shang , Qihan He , Huan Lei , Wenyuan Yang

{"title":"CCANet: A Cross-scale Context Aggregation Network for UAV object detection","authors":"Lei Shang , Qihan He , Huan Lei , Wenyuan Yang","doi":"10.1016/j.cviu.2025.104472","DOIUrl":"10.1016/j.cviu.2025.104472","url":null,"abstract":"<div><div>With the rapid advancement of deep learning technology, Unmanned Aerial Vehicle (UAV) object detection demonstrates significant potential across various fields. However, multi-scale object variations and complex environmental interference in UAV images present considerable challenges. This paper proposes a new UAV object detection network named Cross-scale Context Aggregation Network (CCANet), which contains Multi-scale Convolution Aggregation Darknet (MCADarknet) and Cross-scale Context Aggregation Feature Pyramid Network (CCA-FPN). First, MCADarknet serves as a multi-scale feature extraction network. It employs parallel multi-scale convolutional kernels and depth-wise strip convolution techniques to expand the network’s receptive field, extracting feature maps at four different scales layer by layer. Second, to address interference in complex scenes, a Context Enhanced Fusion method enhances the interaction between adjacent features extracted by MCADarknet and higher-level features to form intermediate features. Finally, CCA-FPN employs a cross-scale fusion strategy to deeply integrate shallow, intermediate, and deep feature information, thereby enhancing object representation in complex scenarios. Experimental results indicate that CCANet performs well on three public datasets. In particular, <span><math><msub><mrow><mo>mAP</mo></mrow><mrow><mn>50</mn></mrow></msub></math></span> and <span><math><msub><mrow><mo>mAP</mo></mrow><mrow><mn>50</mn><mo>−</mo><mn>95</mn></mrow></msub></math></span> can reach 47.4% and 29.4% respectively on the VisDrone dataset. Compared to the baseline model, it achieves improvements of 6.2% and 4.3%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104472"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy 带标签校正策略的时空并行网络优化模糊语音情感识别

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104483

Chenquan Gan , Daitao Zhou , Kexin Wang , Qingyi Zhu , Deepak Kumar Jain , Vitomir Štruc

{"title":"Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy","authors":"Chenquan Gan , Daitao Zhou , Kexin Wang , Qingyi Zhu , Deepak Kumar Jain , Vitomir Štruc","doi":"10.1016/j.cviu.2025.104483","DOIUrl":"10.1016/j.cviu.2025.104483","url":null,"abstract":"<div><div>Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104483"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144902128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bi-granularity balance learning for long-tailed image classification 基于双粒度平衡学习的长尾图像分类

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-20 DOI: 10.1016/j.cviu.2025.104469

Ning Ren , Xiaosong Li , Yanxia Wu , Yan Fu

{"title":"Bi-granularity balance learning for long-tailed image classification","authors":"Ning Ren , Xiaosong Li , Yanxia Wu , Yan Fu","doi":"10.1016/j.cviu.2025.104469","DOIUrl":"10.1016/j.cviu.2025.104469","url":null,"abstract":"<div><div>In long-tailed datasets, the training of deep neural network-based models faces challenges, where the model may become biased towards the head classes with abundant training data, resulting in poor performance on tail classes with limited samples. Most current methods employ contrastive learning to learn more balanced representations by finding the class center. However, these methods use class centers to address local imbalance within a mini-batch, they overlook the global imbalance between batches throughout an epoch, caused by the long-tailed distribution of the dataset. In this paper, we propose <strong>bi-granularity balance</strong> learning to address the two-layer imbalance. We decouple the attraction–repulsion term in contrastive loss into two independent components: global and local balance. The global balance component focuses on capturing semantic information from different perspectives of the image and shifting learning attention from the head classes to the tail classes in the global perspective. The local balance component aims to learn inter-class separability from the local perspective. The proposed method efficiently learns the intra-class compactness and inter-class separability in long-tailed model training and improves the performance of the long-tailed model. Experimental results show that the proposed method achieves competitive performance on long-tailed benchmarks such as CIFAR-10/100-LT, TinyImageNet-LT, and iNaturalist 2018.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104469"},"PeriodicalIF":3.5,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145097947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representation learning of point cloud upsampling in global and local inputs 全局和局部输入点云上采样的表示学习

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-19 DOI: 10.1016/j.cviu.2025.104467

Tongxu Zhang , Bei Wang

引用次数: 0

SAVE: Segment Audio-Visual Easy way using the Segment Anything Model 保存：分割视听使用分割任何模型的简单方法

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-14 DOI: 10.1016/j.cviu.2025.104460

Khanh-Binh Nguyen, Chae Jung Park

{"title":"SAVE: Segment Audio-Visual Easy way using the Segment Anything Model","authors":"Khanh-Binh Nguyen, Chae Jung Park","doi":"10.1016/j.cviu.2025.104460","DOIUrl":"10.1016/j.cviu.2025.104460","url":null,"abstract":"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104460"},"PeriodicalIF":3.5,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DBiSeNet: Dual bilateral segmentation network for real-time semantic segmentation DBiSeNet：用于实时语义分割的双双边分割网络

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-12 DOI: 10.1016/j.cviu.2025.104461

Xiaobo Hu , Hongbo Zhu , Ning Su , Taosheng Xu

{"title":"DBiSeNet: Dual bilateral segmentation network for real-time semantic segmentation","authors":"Xiaobo Hu , Hongbo Zhu , Ning Su , Taosheng Xu","doi":"10.1016/j.cviu.2025.104461","DOIUrl":"10.1016/j.cviu.2025.104461","url":null,"abstract":"<div><div>Bilateral networks have shown effectiveness and efficiency for real-time semantic segmentation. However, the single bilateral architecture exhibits limitations in capturing multi-scale feature representations and addressing misalignment issues during spatial and contextual feature fusion, thereby constraining segmentation accuracy. To address these challenges, we propose a novel dual bilateral segmentation network (DBiSeNet) that incorporates an additional bilateral branch into the original architecture. The additional (high-scale) bilateral operating at high resolution to preserve fine-grained details and responsible for thin object prediction, while the original (low-scale) bilateral maintains an enlarged receptive field to capture global context for large object segmentation. Furthermore, we introduce an aligned and refined feature fusion module to mitigate feature misalignment within each bilateral branch. To optimize the final prediction, we design a dual prediction fusion module that utilizes the low-scale segmentation results as a baseline and adaptively incorporates complementary information from high-scale predictions. Extensive experiments on the Cityscapes and CamVid datasets validate the effectiveness of DBiSeNet in achieving an optimal balance between accuracy and inference speed. In particular, on a single RTX3090 GPU, DBiSeNet2 yields 75.6% mIoU at 225.9 FPS on Cityscapes test set and 75.7% mIoU at 203.4 FPS on CamVid test set.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104461"},"PeriodicalIF":3.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust cross-image adversarial watermark with JPEG resistance for defending against Deepfake models 抗JPEG的鲁棒跨图像对抗水印，用于防御Deepfake模型

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-11 DOI: 10.1016/j.cviu.2025.104459

Zhiyu Lin , Hanbin Lin , Liqiang Lin , Shuwu Chen , Xiaolong Liu

{"title":"Robust cross-image adversarial watermark with JPEG resistance for defending against Deepfake models","authors":"Zhiyu Lin , Hanbin Lin , Liqiang Lin , Shuwu Chen , Xiaolong Liu","doi":"10.1016/j.cviu.2025.104459","DOIUrl":"10.1016/j.cviu.2025.104459","url":null,"abstract":"<div><div>The widespread convenience of generative models has exacerbated the misuse of attribute-editing-based Deepfake technologies, leading to the proliferation of illegally generated content that severely threatens personal privacy and security. Existing proactive defense strategies mitigate Deepfake attacks by embedding imperceptible adversarial watermarks into the spatial-domain of protected images. However, spatial-domain adversarial watermarks are inherently sensitive to lossy compression operations, which significantly degrades their defense efficacy. To address this limitation, we propose a frequency-domain cross-image adversarial watermark generation scheme to enhance robustness toward JPEG compression. In the proposed method, the adversarial watermark training process is migrated to the frequency domain using a differentiable JPEG module, which explicitly simulates the impact of quantization and compression on perturbation distributions. Furthermore, a fusion module is incorporated to coordinate watermark distributions across images, thereby enhancing the generalization of the defense. Experimental results demonstrate that the generated adversarial watermarks exhibit strong robustness against JPEG compression and effectively disrupt the outputs of Deepfake models. Moreover, the proposed scheme can be directly applied to diverse facial images without retraining, thereby providing reliable protection for real-world image application scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104459"},"PeriodicalIF":3.5,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Training-free diffusion for controlling illumination conditions in images 用于控制图像中照明条件的无训练扩散

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-07 DOI: 10.1016/j.cviu.2025.104450

Xiaoyan Xing , Tao Hu , Jan Hendrik Metzen , Konrad Groh , Sezer Karaoglu , Theo Gevers

引用次数: 0

Multiscale Spatio-Temporal Fusion Network for video dehazing 用于视频去雾的多尺度时空融合网络

IF 3.5 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-08-07 DOI: 10.1016/j.cviu.2025.104462

Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang

{"title":"Multiscale Spatio-Temporal Fusion Network for video dehazing","authors":"Qingru Zhang , Guorong Chen , Yixuan Zhang , Jinmei Zhang , Shaofeng Liu , Jian Wang","doi":"10.1016/j.cviu.2025.104462","DOIUrl":"10.1016/j.cviu.2025.104462","url":null,"abstract":"<div><div>Video dehazing aims to restore high-resolution and high-contrast haze-free frames, which is crucial in engineering applications such as intelligent traffic monitoring systems. These monitoring systems heavily rely on clear visual information to ensure accurate decision-making and reliable operation. However, despite significant advances achieved by deep learning methods, they still face challenges when dealing with diverse real-world scenarios. To address these issues, we propose a Multi-Scale Spatio-Temporal Fusion Network (MSTF-Net), a novel framework designed to enhance video dehazing performance in complex engineering environments. Specifically, the MainAux Encoder integrates multi-source information through a progressively enhanced feature fusion mechanism, improving the representation of both global dynamics and local details. Furthermore, the Spatio-Temporal Adaptive Fusion (STAF) module ensures robust temporal consistency and spatial clarity by leveraging multi-level spatio-temporal information fusion. To evaluate our framework, we constructed a challenging dataset named “DarkRoad”, which includes low-light, uneven lighting, and dynamic outdoor scenarios, addressing the key limitations of existing datasets in video dehazing tasks. Extensive experiments demonstrate that MSTF-Net achieves state-of-the-art performance, excelling particularly in applications requiring high clarity, strong contrast, and detailed preservation, providing a reliable solution to video dehazing problems in practical engineering scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104462"},"PeriodicalIF":3.5,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144864592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0