{"title":"Bidirectional temporal and frame-segment attention for sparse action segmentation of figure skating","authors":"","doi":"10.1016/j.cviu.2024.104186","DOIUrl":"10.1016/j.cviu.2024.104186","url":null,"abstract":"<div><div>Temporal action segmentation is a task for understanding human activities in long-term videos. Most of the efforts have been focused on dense-frame action, which relies on strong correlations between frames. However, in the figure skating scene, technical actions are sparsely shown in the video. This brings new challenges: a large amount of redundant temporal information leads to weak frame correlation. To end this, we propose a Bidirectional Temporal and Frame-Segment Attention Module (FSAM). Specifically, we propose an additional reverse-temporal input stream to enhance frame correlation, learned by fusing bidirectional temporal features. In addition, the proposed FSAM contains a Multi-stage segment-aware GCN and decoder interaction module, aiming to learn the correlation between segment features across time domains and integrate embeddings between frame and segment representations. To evaluate our approach, we propose the Figure Skating Sparse Action Segmentation (FSSAS) dataset: The dataset comprises 100 samples of the Olympic figure skating final and semi-final competition, with more than 50 different men and women athletes. Extensive experiments show that our method achieves an accuracy of 87.75 and an edit score of 90.18 on the FSSAS dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives","authors":"","doi":"10.1016/j.cviu.2024.104187","DOIUrl":"10.1016/j.cviu.2024.104187","url":null,"abstract":"<div><div>Social networks are creating a digital world in which the cognitive, emotional, and pragmatic value of the imagery of human faces and bodies is arguably changing. However, researchers in the digital humanities are often ill-equipped to study these phenomena at scale. This work presents FRESCO (Face Representation in E-Societies through Computational Observation), a framework designed to explore the socio-cultural implications of images on social media platforms at scale. FRESCO deconstructs images into numerical and categorical variables using state-of-the-art computer vision techniques, aligning with the principles of visual semiotics. The framework analyzes images across three levels: the plastic level, encompassing fundamental visual features like lines and colors; the figurative level, representing specific entities or concepts; and the enunciation level, which focuses particularly on constructing the point of view of the spectator and observer. These levels are analyzed to discern deeper narrative layers within the imagery. Experimental validation confirms the reliability and utility of FRESCO, and we assess its consistency and precision across two public datasets. Subsequently, we introduce the FRESCO score, a metric derived from the framework’s output that serves as a reliable measure of similarity in image content.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M-adapter: Multi-level image-to-video adaptation for video action recognition","authors":"","doi":"10.1016/j.cviu.2024.104150","DOIUrl":"10.1016/j.cviu.2024.104150","url":null,"abstract":"<div><div>With the growing size of visual foundation models, training video models from scratch has become costly and challenging. Recent attempts focus on transferring frozen pre-trained Image Models (PIMs) to video fields by tuning inserted learnable parameters such as adapters and prompts. However, these methods require saving PIM activations for gradient calculations, leading to limited savings of GPU memory. In this paper, we propose a novel parallel branch that adapts the multi-level outputs of the frozen PIM for action recognition. It avoids passing gradients through the PIMs, thus naturally owning much lower GPU memory footprints. The proposed adaptation branch consists of hierarchically combined multi-level output adapters (M-adapters), comprising a fusion module and a temporal module. This design digests the existing discrepancies between the pre-training task and the target task with lower training costs. We show that when using larger models or on scenarios with higher demands for temporal modelling, the proposed method performs better than those with the full-parameter tuning manner. Finally, despite only tuning fewer parameters, our method achieves superior or comparable performance against current state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial attention for human-centric visual understanding: An Information Bottleneck method","authors":"","doi":"10.1016/j.cviu.2024.104180","DOIUrl":"10.1016/j.cviu.2024.104180","url":null,"abstract":"<div><div>The selective visual attention mechanism in the Human Visual System (HVS) restricts the amount of information that reaches human visual awareness, allowing the brain to perceive high-fidelity natural scenes in real-time with limited computational cost. This selectivity acts as an “Information Bottleneck (IB)” that balances information compression and predictive accuracy. However, such information constraints are rarely explored in the attention mechanism for deep neural networks (DNNs). This paper introduces an IB-inspired spatial attention module for DNNs, which generates an attention map by minimizing the mutual information (MI) between the attentive content and the input while maximizing that between the attentive content and the output. We develop this IB-inspired attention mechanism based on a novel graphical model and explore various implementations of the framework. We show that our approach can yield attention maps that neatly highlight the regions of interest while suppressing the backgrounds, and are interpretable for the decision-making of the DNNs. To validate the effectiveness of the proposed IB-inspired attention mechanism, we apply it to various computer vision tasks including image classification, fine-grained recognition, cross-domain classification, semantic segmentation, and object detection. Extensive experiments demonstrate that it bootstraps standard DNN structures quantitatively and qualitatively for these tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodality-guided Visual-Caption Semantic Enhancement","authors":"","doi":"10.1016/j.cviu.2024.104139","DOIUrl":"10.1016/j.cviu.2024.104139","url":null,"abstract":"<div><div>Video captions generated with single modality, e.g. video clips, often suffer from insufficient event discovery and inadequate scene description. Therefore, this paper aims to improve the quality of captions by addressing these issues through the integration of multi-modal information. Specifically, We first construct a multi-modal dataset and introduce the triplet annotations of video, audio and text, fostering a comprehensive exploration about the associations between different modalities. Build upon this, We propose to explore the collaborative perception of audio and visual concepts to mitigate inaccuracies and incompleteness in captions in vision-based benchmarks by incorporating audio-visual perception priors. To achieve this, we extract effective semantic features from visual and auditory modalities, bridge the semantic gap between audio-visual modalities and text, and form a more precise knowledge graph multimodal coherence checking and information pruning mechanism. Exhaustive experiments demonstrate that the proposed approach surpasses existing methods and generalizes well with the assistance of ChatGPT.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging the gap between object detection in close-up and high-resolution wide shots","authors":"","doi":"10.1016/j.cviu.2024.104181","DOIUrl":"10.1016/j.cviu.2024.104181","url":null,"abstract":"<div><div>Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform accurate and efficient object detection with such large input in low-power edge devices like UAVs. There are several unique challenges in HRW shots. (1) Sparse information: the objects of interest cover less area. (2) Various scale: there is 10 to 100<span><math><mo>×</mo></math></span> object scale change in one single image. (3) Incomplete objects: the sliding window strategy to handle the large input leads to truncated objects at the window edge. (4) Multi-scale information: it is unclear how to use multi-scale information in training and inference. Consequently, directly using a close-up detector leads to inaccuracy and inefficiency. In this paper, we systematically investigate this problem and bridge the gap between object detection in close-up and HRW shots, by introducing a novel sparse architecture that can be integrated with common networks like ConvNet and Transformer. It leverages alternative sparse learning to complementarily fuse coarse-grained and fine-grained features to (1) adaptively extract valuable information from (2) different object scales. We also propose a novel Cross-window Non-Maximum Suppression (C-NMS) algorithm to (3) improve the box merge from different windows. Furthermore, we propose a (4) simple yet effective multi-scale training and inference strategy to improve accuracy. Experiments on two benchmarks with HRW shots, PANDA and DOTA-v1.0, demonstrate that our methods significantly improve accuracy (up to 5.8%) and speed (up to 3<span><math><mo>×</mo></math></span>) over SotAs, for both ConvNet or Transformer based detectors, on edge devices. Our code is open-sourced and available at <span><span>https://github.com/liwenxi/SparseFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deformable surface reconstruction via Riemannian metric preservation","authors":"","doi":"10.1016/j.cviu.2024.104155","DOIUrl":"10.1016/j.cviu.2024.104155","url":null,"abstract":"<div><div>Estimating the pose of an object from a monocular image is a fundamental inverse problem in computer vision. Due to its ill-posed nature, solving this problem requires incorporating deformation priors. In practice, many materials do not perceptibly shrink or extend when manipulated, constituting a reliable and well-known prior. Mathematically, this translates to the preservation of the Riemannian metric. Neural networks offer the perfect playground to solve the surface reconstruction problem as they can approximate surfaces with arbitrary precision and allow the computation of differential geometry quantities. This paper presents an approach for inferring continuous deformable surfaces from a sequence of images, which is benchmarked against several techniques and achieves state-of-the-art performance without the need for offline training. Being a method that performs per-frame optimization, our method can refine its estimates, contrary to those based on performing a single inference step. Despite enforcing differential geometry constraints at each update, our approach is the fastest of all the tested optimization-based methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002364/pdfft?md5=e37118b164489f2910fb59a519a86d29&pid=1-s2.0-S1077314224002364-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LCMA-Net: A light cross-modal attention network for streamer re-identification in live video","authors":"","doi":"10.1016/j.cviu.2024.104183","DOIUrl":"10.1016/j.cviu.2024.104183","url":null,"abstract":"<div><div>With the rapid expansion of the we-media industry, streamers have increasingly incorporated inappropriate content into live videos to attract traffic and pursue interests. Blacklisted streamers often forge their identities or switch platforms to continue streaming, causing significant harm to the online environment. Consequently, streamer re-identification (re-ID) has become of paramount importance. Streamer biometrics in live videos exhibit multimodal characteristics, including voiceprints, faces, and spatiotemporal information, which complement each other. Therefore, we propose a light cross-modal attention network (LCMA-Net) for streamer re-ID in live videos. First, the voiceprint, face, and spatiotemporal features of the streamer are extracted by RawNet-SA, <span><math><mi>Π</mi></math></span>-Net, and STDA-ResNeXt3D, respectively. We then design a light cross-modal pooling attention (LCMPA) module, which, combined with a multilayer perceptron (MLP), aligns and concatenates different modality features into multimodal features within the LCMA-Net. Finally, the streamer is re-identified by measuring the similarity between these multimodal features. Five experiments were conducted on the StreamerReID dataset, and the results demonstrated that the proposed method achieved competitive performance. The dataset and code are available at <span><span>https://github.com/BJUT-AIVBD/LCMA-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Specular highlight removal using Quaternion transformer","authors":"","doi":"10.1016/j.cviu.2024.104179","DOIUrl":"10.1016/j.cviu.2024.104179","url":null,"abstract":"<div><div>Specular highlight removal is a very important issue, because specular highlight reflections in images with illumination changes can give very negative effects on various computer vision and image processing tasks. Numerous state-of-the-art networks for the specular removal use convolutional neural networks (CNN), which cannot learn global context effectively. They capture spatial information while overlooking 3D structural correlation information of an RGB image. To address this problem, we introduce a specular highlight removal network based on Quaternion transformer (QformerSHR), which employs a transformer architecture based on Quaternion representation. In particular, a depth-wise separable Quaternion convolutional layer (DSQConv) is proposed to enhance computational performance of QformerSHR, while efficiently preserving the structural correlation of an RGB image by utilizing the Quaternion representation. In addition, a Quaternion transformer block (QTB) based on DSQConv learns global context. As a result, QformerSHR consisting of DSQConv and QTB performs the specular removal from natural and text image datasets effectively. Experimental results demonstrate that it is significantly more effective than state-of-the-art networks for the specular removal, in terms of both quantitative performance and subjective quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating optical flow: A comprehensive review of the state of the art","authors":"","doi":"10.1016/j.cviu.2024.104160","DOIUrl":"10.1016/j.cviu.2024.104160","url":null,"abstract":"<div><div>Optical flow estimation is a crucial task in computer vision that provides low-level motion information. Despite recent advances, real-world applications still present significant challenges. This survey provides an overview of optical flow techniques and their application. For a comprehensive review, this survey covers both classical frameworks and the latest AI-based techniques. In doing so, we highlight the limitations of current benchmarks and metrics, underscoring the need for more representative datasets and comprehensive evaluation methods. The survey also highlights the importance of integrating industry knowledge and adopting training practices optimized for deep learning-based models. By addressing these issues, future research can aid the development of robust and efficient optical flow methods that can effectively address real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002418/pdfft?md5=0e040acf6e4116194d80885aeb4b2b49&pid=1-s2.0-S1077314224002418-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}