Computer Vision and Image Understanding最新文献_第10页

TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis TEMSA：用于多模态情感分析的文本增强模态表示学习

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-16 DOI: 10.1016/j.cviu.2025.104391

Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia

{"title":"TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis","authors":"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia","doi":"10.1016/j.cviu.2025.104391","DOIUrl":"10.1016/j.cviu.2025.104391","url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104391"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSPPNet: Cascade space pyramid pooling network for object detection 用于目标检测的级联空间金字塔池化网络

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-14 DOI: 10.1016/j.cviu.2025.104377

Yafeng Liu, Yongsheng Dong

引用次数: 0

EUN: Enhanced unlearnable examples generation approach for privacy protection EUN：用于隐私保护的增强不可学习示例生成方法

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-13 DOI: 10.1016/j.cviu.2025.104388

Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He

{"title":"EUN: Enhanced unlearnable examples generation approach for privacy protection","authors":"Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He","doi":"10.1016/j.cviu.2025.104388","DOIUrl":"10.1016/j.cviu.2025.104388","url":null,"abstract":"<div><div>In the era of artificial intelligence, the importance of protecting user privacy has become increasingly prominent. Unlearnable examples prevent deep learning models from learning semantic features in images by adding perturbations or noise that are imperceptible to the human eye. Existing perturbation generation methods are not robust to defense methods or are only robust to one defense method. To address this problem, we propose an enhanced perturbation generation method for unlearnable examples. This method generates the perturbation by performing a class-wise convolution on the image and changing a pixel in the local position of the image. This method is robust to multiple defense methods. In addition, by adjusting the order of global position convolution and local position pixel change of the image, variants of the method were generated and analyzed. We have tested our method on a variety of datasets with a variety of models, and compared with 6 perturbation generation methods. The results demonstrate that the clean test accuracy of the enhanced perturbation generation method for unlearnable examples is still less than 35% when facing defense methods such as image shortcut squeezing, adversarial training, and adversarial augmentation. It outperforms existing perturbation generation methods in many aspects, and is 20% lower than CUDA and OPS, two excellent perturbation generation methods, under several parameter settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104388"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AODGCN: Adaptive object detection with attention-guided dynamic graph convolutional network 基于注意引导的动态图卷积网络的自适应目标检测

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-13 DOI: 10.1016/j.cviu.2025.104386

Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan

{"title":"AODGCN: Adaptive object detection with attention-guided dynamic graph convolutional network","authors":"Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan","doi":"10.1016/j.cviu.2025.104386","DOIUrl":"10.1016/j.cviu.2025.104386","url":null,"abstract":"<div><div>Various classifiers based on convolutional neural networks have been successfully applied to image classification in object detection. However, object detection is much more sophisticated and most classifiers used in this context exhibit limitations in capturing contextual information, particularly in scenarios with complex backgrounds or occlusions. Additionally, they lack spatial awareness, resulting in the loss of spatial structure and inadequate modeling of object details and context. In this paper, we propose an adaptive object detection approach using an attention-guided dynamic graph convolutional network (AODGCN). AODGCN represents images as graphs, enabling the capture of object properties such as connectivity, proximity, and hierarchical relationships. Attention mechanisms guide the model to focus on informative regions, highlighting relevant features while suppressing background information. This attention-guided approach enhances the model’s ability to capture discriminative features. Furthermore, the dynamic graph convolutional network (D-GCN) adjusts the receptive field size and weight coefficients based on object characteristics, enabling adaptive detection of objects with varying sizes. The achieved results demonstrate the effectiveness of AODGCN on the MS-COCO 2017 dataset, with a significant improvement of 1.6% in terms of mean average precision (mAP) compared to state-of-the-art algorithms.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104386"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Equipping sketch patches with context-aware positional encoding for graphic sketch representation 为图形草图表示配备具有上下文感知的位置编码的草图补丁

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-13 DOI: 10.1016/j.cviu.2025.104385

Sicong Zang, Zhijun Fang

{"title":"Equipping sketch patches with context-aware positional encoding for graphic sketch representation","authors":"Sicong Zang, Zhijun Fang","doi":"10.1016/j.cviu.2025.104385","DOIUrl":"10.1016/j.cviu.2025.104385","url":null,"abstract":"<div><div>When benefiting graphic sketch representation with sketch drawing orders, recent studies have linked sketch patches as graph edges by drawing orders in accordance to a temporal-based nearest neighboring strategy. However, such constructed graph edges may be unreliable, since the contextual relationships between patches may be inconsistent with the sequential positions in drawing orders, due to variants of sketch drawings. In this paper, we propose a variant-drawing-protected method by equipping sketch patches with context-aware positional encoding (PE) to make better use of drawing orders for sketch learning. We introduce a sinusoidal absolute PE to embed the sequential positions in drawing orders, and a learnable relative PE to encode the unseen contextual relationships between patches. Both types of PEs never attend the construction of graph edges, but are injected into graph nodes to cooperate with the visual patterns captured from patches. After linking nodes by semantic proximity, during message aggregation via graph convolutional networks, each node receives both semantic features from patches and contextual information from PEs from its neighbors, which equips local patch patterns with global contextual information, further obtaining drawing-order-enhanced sketch representations. Experimental results indicate that our method significantly improves sketch healing and controllable sketch synthesis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104385"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

STELA: Spatial–temporal enhanced learning with an anatomical graph transformer for 3D human pose estimation STELA：时空增强学习与三维人体姿态估计的解剖图形转换器

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-09 DOI: 10.1016/j.cviu.2025.104381

Jian Son, Jiho Lee, Eunwoo Kim

{"title":"STELA: Spatial–temporal enhanced learning with an anatomical graph transformer for 3D human pose estimation","authors":"Jian Son, Jiho Lee, Eunwoo Kim","doi":"10.1016/j.cviu.2025.104381","DOIUrl":"10.1016/j.cviu.2025.104381","url":null,"abstract":"<div><div>Transformers have led to remarkable performance improvements in 3D human pose estimation by capturing global dependencies between joints in spatial and temporal aspects. To leverage human body topology information, attempts have been made to incorporate graph representation within a transformer architecture. However, they neglect spatial–temporal anatomical knowledge inherent in the human body, without considering the implicit relationships of non-connected joints. Furthermore, they disregard the movement patterns between joint trajectories, concentrating on the trajectories of individual joints. In this paper, we propose Spatial–Temporal Enhanced Learning with an Anatomical graph transformer (STELA) to aggregate the spatial–temporal global relationships and intricate anatomical relationships between joints. It consists of Global Self-attention (GS) and Anatomical Graph-attention (AG) branches. GS learns long-range dependencies between all joints across entire frames. AG focuses on the anatomical relationships of the human body in the spatial–temporal aspect using skeleton and motion pattern graphs. Extensive experiments demonstrate that STELA outperforms state-of-the-art approaches with an average of 41% fewer parameters, reducing MPJPE by an average of 2.7 mm on Human3.6M and 1.5 mm on MPI-INF-3DHP.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104381"},"PeriodicalIF":4.3,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Grad-CAM: The impact of large receptive fields and other caveats 大接收野的影响和其他警告

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-08 DOI: 10.1016/j.cviu.2025.104383

Rui Santos, João Pedrosa, Ana Maria Mendonça, Aurélio Campilho

{"title":"Grad-CAM: The impact of large receptive fields and other caveats","authors":"Rui Santos, João Pedrosa, Ana Maria Mendonça, Aurélio Campilho","doi":"10.1016/j.cviu.2025.104383","DOIUrl":"10.1016/j.cviu.2025.104383","url":null,"abstract":"<div><div>The increase in complexity of deep learning models demands explanations that can be obtained with methods like Grad-CAM. This method computes an importance map for the last convolutional layer relative to a specific class, which is then upsampled to match the size of the input. However, this final step assumes that there is a spatial correspondence between the last feature map and the input, which may not be the case. We hypothesize that, for models with large receptive fields, the feature spatial organization is not kept during the forward pass, which may render the explanations devoid of meaning. To test this hypothesis, common architectures were applied to a medical scenario on the public VinDr-CXR dataset, to a subset of ImageNet and to datasets derived from MNIST. The results show a significant dispersion of the spatial information, which goes against the assumption of Grad-CAM, and that explainability maps are affected by this dispersion. Furthermore, we discuss several other caveats regarding Grad-CAM, such as feature map rectification, empty maps and the impact of global average pooling or flatten layers. Altogether, this work addresses some key limitations of Grad-CAM which may go unnoticed for common users, taking one step further in the pursuit for more reliable explainability methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104383"},"PeriodicalIF":4.3,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network 基于混合卷积交叉补丁保留网络的高光谱图像分类

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-07 DOI: 10.1016/j.cviu.2025.104382

Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava

{"title":"Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network","authors":"Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava","doi":"10.1016/j.cviu.2025.104382","DOIUrl":"10.1016/j.cviu.2025.104382","url":null,"abstract":"<div><div>Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104382"},"PeriodicalIF":4.3,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Egocentric and exocentric methods: A short survey 自我中心和外中心方法：一个简短的调查

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-07 DOI: 10.1016/j.cviu.2025.104371

Anirudh Thatipelli , Shao-Yuan Lo , Amit K. Roy-Chowdhury

引用次数: 0

Underwater image quality evaluation via deep meta-learning: Dataset and objective method 基于深度元学习的水下图像质量评价：数据集和客观方法

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-05-06 DOI: 10.1016/j.cviu.2025.104380

Tianhai Chen , Xichen Yang , Tianshu Wang , Nengxin Li , Shun Zhu , Xiaobo Shen

{"title":"Underwater image quality evaluation via deep meta-learning: Dataset and objective method","authors":"Tianhai Chen , Xichen Yang , Tianshu Wang , Nengxin Li , Shun Zhu , Xiaobo Shen","doi":"10.1016/j.cviu.2025.104380","DOIUrl":"10.1016/j.cviu.2025.104380","url":null,"abstract":"<div><div>The degradation of underwater image quality due to complex environments affects the effectiveness of the application, making accurate quality assessment crucial. However, existing Underwater Image Quality Assessment (UIQA) methods lack sufficient reliable data. To address this, we construct the DART2024 dataset, containing 1,000 raw images and 10,000 distorted images generated by 10 enhancement methods, covering diverse underwater scenarios. We propose a novel UIQA method that weights original images via gradient maps, highlights details, constructs a multi-scale deep neural network with perception, fusion, and prediction modules to describe quality characteristics, and designs a meta-learning framework for rapid adaptation to unknown distortions. The experimental results show that DART2024 is credible and meets the training requirements. Our method outperforms SOTA approaches in accuracy, stability, and convergence speed on DART2024 and other underwater datasets. It also shows higher applicability on natural scene datasets. The dataset and source code for the proposed method can be made available at <span><span>https://github.com/dart-into/DART2024</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104380"},"PeriodicalIF":4.3,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143923306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0