Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia
{"title":"TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis","authors":"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia","doi":"10.1016/j.cviu.2025.104391","DOIUrl":"10.1016/j.cviu.2025.104391","url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104391"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CSPPNet: Cascade space pyramid pooling network for object detection","authors":"Yafeng Liu, Yongsheng Dong","doi":"10.1016/j.cviu.2025.104377","DOIUrl":"10.1016/j.cviu.2025.104377","url":null,"abstract":"<div><div>Real-time object detection, as an important research direction in the field of computer vision, aims to achieve fast and accurate object detection. However, many current methods fail to achieve a balance between speed, parameters, and accuracy. To alleviate this problem, in this paper, we construct a novel cascade spatial pyramid pooling network (CSPPNet) for object detection. In particular, we first propose a cascade feature fusion (CFF) module, which combines the novel cascade cross-layer structure and GSConv convolution to lighten the existing necking structure and improve the detection accuracy of the model without adding a large number of parameters. In addition, in order to alleviate the loss of feature detail information due to max pooling, we further propose the nest space pooling (NSP) module, which combines nest feature fusion with max pooling operations to improve the fusion performance of local feature information with global feature information. Experimental results show that our CSPPNet is competitive, achieving 43.1% AP on the MS-COCO 2017 test-dev dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104377"},"PeriodicalIF":4.3,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He
{"title":"EUN: Enhanced unlearnable examples generation approach for privacy protection","authors":"Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He","doi":"10.1016/j.cviu.2025.104388","DOIUrl":"10.1016/j.cviu.2025.104388","url":null,"abstract":"<div><div>In the era of artificial intelligence, the importance of protecting user privacy has become increasingly prominent. Unlearnable examples prevent deep learning models from learning semantic features in images by adding perturbations or noise that are imperceptible to the human eye. Existing perturbation generation methods are not robust to defense methods or are only robust to one defense method. To address this problem, we propose an enhanced perturbation generation method for unlearnable examples. This method generates the perturbation by performing a class-wise convolution on the image and changing a pixel in the local position of the image. This method is robust to multiple defense methods. In addition, by adjusting the order of global position convolution and local position pixel change of the image, variants of the method were generated and analyzed. We have tested our method on a variety of datasets with a variety of models, and compared with 6 perturbation generation methods. The results demonstrate that the clean test accuracy of the enhanced perturbation generation method for unlearnable examples is still less than 35% when facing defense methods such as image shortcut squeezing, adversarial training, and adversarial augmentation. It outperforms existing perturbation generation methods in many aspects, and is 20% lower than CUDA and OPS, two excellent perturbation generation methods, under several parameter settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104388"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan
{"title":"AODGCN: Adaptive object detection with attention-guided dynamic graph convolutional network","authors":"Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan","doi":"10.1016/j.cviu.2025.104386","DOIUrl":"10.1016/j.cviu.2025.104386","url":null,"abstract":"<div><div>Various classifiers based on convolutional neural networks have been successfully applied to image classification in object detection. However, object detection is much more sophisticated and most classifiers used in this context exhibit limitations in capturing contextual information, particularly in scenarios with complex backgrounds or occlusions. Additionally, they lack spatial awareness, resulting in the loss of spatial structure and inadequate modeling of object details and context. In this paper, we propose an adaptive object detection approach using an attention-guided dynamic graph convolutional network (AODGCN). AODGCN represents images as graphs, enabling the capture of object properties such as connectivity, proximity, and hierarchical relationships. Attention mechanisms guide the model to focus on informative regions, highlighting relevant features while suppressing background information. This attention-guided approach enhances the model’s ability to capture discriminative features. Furthermore, the dynamic graph convolutional network (D-GCN) adjusts the receptive field size and weight coefficients based on object characteristics, enabling adaptive detection of objects with varying sizes. The achieved results demonstrate the effectiveness of AODGCN on the MS-COCO 2017 dataset, with a significant improvement of 1.6% in terms of mean average precision (mAP) compared to state-of-the-art algorithms.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104386"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Equipping sketch patches with context-aware positional encoding for graphic sketch representation","authors":"Sicong Zang, Zhijun Fang","doi":"10.1016/j.cviu.2025.104385","DOIUrl":"10.1016/j.cviu.2025.104385","url":null,"abstract":"<div><div>When benefiting graphic sketch representation with sketch drawing orders, recent studies have linked sketch patches as graph edges by drawing orders in accordance to a temporal-based nearest neighboring strategy. However, such constructed graph edges may be unreliable, since the contextual relationships between patches may be inconsistent with the sequential positions in drawing orders, due to variants of sketch drawings. In this paper, we propose a variant-drawing-protected method by equipping sketch patches with context-aware positional encoding (PE) to make better use of drawing orders for sketch learning. We introduce a sinusoidal absolute PE to embed the sequential positions in drawing orders, and a learnable relative PE to encode the unseen contextual relationships between patches. Both types of PEs never attend the construction of graph edges, but are injected into graph nodes to cooperate with the visual patterns captured from patches. After linking nodes by semantic proximity, during message aggregation via graph convolutional networks, each node receives both semantic features from patches and contextual information from PEs from its neighbors, which equips local patch patterns with global contextual information, further obtaining drawing-order-enhanced sketch representations. Experimental results indicate that our method significantly improves sketch healing and controllable sketch synthesis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104385"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STELA: Spatial–temporal enhanced learning with an anatomical graph transformer for 3D human pose estimation","authors":"Jian Son, Jiho Lee, Eunwoo Kim","doi":"10.1016/j.cviu.2025.104381","DOIUrl":"10.1016/j.cviu.2025.104381","url":null,"abstract":"<div><div>Transformers have led to remarkable performance improvements in 3D human pose estimation by capturing global dependencies between joints in spatial and temporal aspects. To leverage human body topology information, attempts have been made to incorporate graph representation within a transformer architecture. However, they neglect spatial–temporal anatomical knowledge inherent in the human body, without considering the implicit relationships of non-connected joints. Furthermore, they disregard the movement patterns between joint trajectories, concentrating on the trajectories of individual joints. In this paper, we propose Spatial–Temporal Enhanced Learning with an Anatomical graph transformer (STELA) to aggregate the spatial–temporal global relationships and intricate anatomical relationships between joints. It consists of Global Self-attention (GS) and Anatomical Graph-attention (AG) branches. GS learns long-range dependencies between all joints across entire frames. AG focuses on the anatomical relationships of the human body in the spatial–temporal aspect using skeleton and motion pattern graphs. Extensive experiments demonstrate that STELA outperforms state-of-the-art approaches with an average of 41% fewer parameters, reducing MPJPE by an average of 2.7 mm on Human3.6M and 1.5 mm on MPI-INF-3DHP.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104381"},"PeriodicalIF":4.3,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143936422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Santos, João Pedrosa, Ana Maria Mendonça, Aurélio Campilho
{"title":"Grad-CAM: The impact of large receptive fields and other caveats","authors":"Rui Santos, João Pedrosa, Ana Maria Mendonça, Aurélio Campilho","doi":"10.1016/j.cviu.2025.104383","DOIUrl":"10.1016/j.cviu.2025.104383","url":null,"abstract":"<div><div>The increase in complexity of deep learning models demands explanations that can be obtained with methods like Grad-CAM. This method computes an importance map for the last convolutional layer relative to a specific class, which is then upsampled to match the size of the input. However, this final step assumes that there is a spatial correspondence between the last feature map and the input, which may not be the case. We hypothesize that, for models with large receptive fields, the feature spatial organization is not kept during the forward pass, which may render the explanations devoid of meaning. To test this hypothesis, common architectures were applied to a medical scenario on the public VinDr-CXR dataset, to a subset of ImageNet and to datasets derived from MNIST. The results show a significant dispersion of the spatial information, which goes against the assumption of Grad-CAM, and that explainability maps are affected by this dispersion. Furthermore, we discuss several other caveats regarding Grad-CAM, such as feature map rectification, empty maps and the impact of global average pooling or flatten layers. Altogether, this work addresses some key limitations of Grad-CAM which may go unnoticed for common users, taking one step further in the pursuit for more reliable explainability methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104383"},"PeriodicalIF":4.3,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network","authors":"Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava","doi":"10.1016/j.cviu.2025.104382","DOIUrl":"10.1016/j.cviu.2025.104382","url":null,"abstract":"<div><div>Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104382"},"PeriodicalIF":4.3,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anirudh Thatipelli , Shao-Yuan Lo , Amit K. Roy-Chowdhury
{"title":"Egocentric and exocentric methods: A short survey","authors":"Anirudh Thatipelli , Shao-Yuan Lo , Amit K. Roy-Chowdhury","doi":"10.1016/j.cviu.2025.104371","DOIUrl":"10.1016/j.cviu.2025.104371","url":null,"abstract":"<div><div>Egocentric vision captures the scene from the point of view of the camera wearer while exocentric vision captures the overall scene context. Jointly modeling ego and exo views is crucial to developing next-generation AI agents. The community has regained interest in the field of egocentric vision. While the third-person view and first-person have been thoroughly investigated, very few works aim to study both synchronously. Exocentric videos contain many relevant signals that are transferrable to egocentric videos. This paper provides a timely overview of works combining egocentric and exocentric visions, a very new but promising research topic. We describe in detail the datasets and present a survey of the key applications of ego-exo joint learning, where we identify the most recent advances. With the presentation of the current status of the progress, we believe this short but timely survey will be valuable to the broad video-understanding community, particularly when multi-view modeling is critical.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104371"},"PeriodicalIF":4.3,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143923305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianhai Chen , Xichen Yang , Tianshu Wang , Nengxin Li , Shun Zhu , Xiaobo Shen
{"title":"Underwater image quality evaluation via deep meta-learning: Dataset and objective method","authors":"Tianhai Chen , Xichen Yang , Tianshu Wang , Nengxin Li , Shun Zhu , Xiaobo Shen","doi":"10.1016/j.cviu.2025.104380","DOIUrl":"10.1016/j.cviu.2025.104380","url":null,"abstract":"<div><div>The degradation of underwater image quality due to complex environments affects the effectiveness of the application, making accurate quality assessment crucial. However, existing Underwater Image Quality Assessment (UIQA) methods lack sufficient reliable data. To address this, we construct the DART2024 dataset, containing 1,000 raw images and 10,000 distorted images generated by 10 enhancement methods, covering diverse underwater scenarios. We propose a novel UIQA method that weights original images via gradient maps, highlights details, constructs a multi-scale deep neural network with perception, fusion, and prediction modules to describe quality characteristics, and designs a meta-learning framework for rapid adaptation to unknown distortions. The experimental results show that DART2024 is credible and meets the training requirements. Our method outperforms SOTA approaches in accuracy, stability, and convergence speed on DART2024 and other underwater datasets. It also shows higher applicability on natural scene datasets. The dataset and source code for the proposed method can be made available at <span><span>https://github.com/dart-into/DART2024</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104380"},"PeriodicalIF":4.3,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143923306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}