Computer Vision and Image Understanding最新文献

筛选
英文 中文
Bidirectional temporal and frame-segment attention for sparse action segmentation of figure skating 用于花样滑冰稀疏动作分割的双向时间和帧段注意力
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-26 DOI: 10.1016/j.cviu.2024.104186
{"title":"Bidirectional temporal and frame-segment attention for sparse action segmentation of figure skating","authors":"","doi":"10.1016/j.cviu.2024.104186","DOIUrl":"10.1016/j.cviu.2024.104186","url":null,"abstract":"<div><div>Temporal action segmentation is a task for understanding human activities in long-term videos. Most of the efforts have been focused on dense-frame action, which relies on strong correlations between frames. However, in the figure skating scene, technical actions are sparsely shown in the video. This brings new challenges: a large amount of redundant temporal information leads to weak frame correlation. To end this, we propose a Bidirectional Temporal and Frame-Segment Attention Module (FSAM). Specifically, we propose an additional reverse-temporal input stream to enhance frame correlation, learned by fusing bidirectional temporal features. In addition, the proposed FSAM contains a Multi-stage segment-aware GCN and decoder interaction module, aiming to learn the correlation between segment features across time domains and integrate embeddings between frame and segment representations. To evaluate our approach, we propose the Figure Skating Sparse Action Segmentation (FSSAS) dataset: The dataset comprises 100 samples of the Olympic figure skating final and semi-final competition, with more than 50 different men and women athletes. Extensive experiments show that our method achieves an accuracy of 87.75 and an edit score of 90.18 on the FSSAS dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives 符号学人工智能:衔接计算机视觉和视觉符号学,对大规模面部图像档案进行计算观察
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-26 DOI: 10.1016/j.cviu.2024.104187
{"title":"For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives","authors":"","doi":"10.1016/j.cviu.2024.104187","DOIUrl":"10.1016/j.cviu.2024.104187","url":null,"abstract":"<div><div>Social networks are creating a digital world in which the cognitive, emotional, and pragmatic value of the imagery of human faces and bodies is arguably changing. However, researchers in the digital humanities are often ill-equipped to study these phenomena at scale. This work presents FRESCO (Face Representation in E-Societies through Computational Observation), a framework designed to explore the socio-cultural implications of images on social media platforms at scale. FRESCO deconstructs images into numerical and categorical variables using state-of-the-art computer vision techniques, aligning with the principles of visual semiotics. The framework analyzes images across three levels: the plastic level, encompassing fundamental visual features like lines and colors; the figurative level, representing specific entities or concepts; and the enunciation level, which focuses particularly on constructing the point of view of the spectator and observer. These levels are analyzed to discern deeper narrative layers within the imagery. Experimental validation confirms the reliability and utility of FRESCO, and we assess its consistency and precision across two public datasets. Subsequently, we introduce the FRESCO score, a metric derived from the framework’s output that serves as a reliable measure of similarity in image content.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
M-adapter: Multi-level image-to-video adaptation for video action recognition M-adapter:用于视频动作识别的多级图像视频适配器
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-25 DOI: 10.1016/j.cviu.2024.104150
{"title":"M-adapter: Multi-level image-to-video adaptation for video action recognition","authors":"","doi":"10.1016/j.cviu.2024.104150","DOIUrl":"10.1016/j.cviu.2024.104150","url":null,"abstract":"<div><div>With the growing size of visual foundation models, training video models from scratch has become costly and challenging. Recent attempts focus on transferring frozen pre-trained Image Models (PIMs) to video fields by tuning inserted learnable parameters such as adapters and prompts. However, these methods require saving PIM activations for gradient calculations, leading to limited savings of GPU memory. In this paper, we propose a novel parallel branch that adapts the multi-level outputs of the frozen PIM for action recognition. It avoids passing gradients through the PIMs, thus naturally owning much lower GPU memory footprints. The proposed adaptation branch consists of hierarchically combined multi-level output adapters (M-adapters), comprising a fusion module and a temporal module. This design digests the existing discrepancies between the pre-training task and the target task with lower training costs. We show that when using larger models or on scenarios with higher demands for temporal modelling, the proposed method performs better than those with the full-parameter tuning manner. Finally, despite only tuning fewer parameters, our method achieves superior or comparable performance against current state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial attention for human-centric visual understanding: An Information Bottleneck method 以人为本的视觉理解空间注意力:信息瓶颈法
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-24 DOI: 10.1016/j.cviu.2024.104180
{"title":"Spatial attention for human-centric visual understanding: An Information Bottleneck method","authors":"","doi":"10.1016/j.cviu.2024.104180","DOIUrl":"10.1016/j.cviu.2024.104180","url":null,"abstract":"<div><div>The selective visual attention mechanism in the Human Visual System (HVS) restricts the amount of information that reaches human visual awareness, allowing the brain to perceive high-fidelity natural scenes in real-time with limited computational cost. This selectivity acts as an “Information Bottleneck (IB)” that balances information compression and predictive accuracy. However, such information constraints are rarely explored in the attention mechanism for deep neural networks (DNNs). This paper introduces an IB-inspired spatial attention module for DNNs, which generates an attention map by minimizing the mutual information (MI) between the attentive content and the input while maximizing that between the attentive content and the output. We develop this IB-inspired attention mechanism based on a novel graphical model and explore various implementations of the framework. We show that our approach can yield attention maps that neatly highlight the regions of interest while suppressing the backgrounds, and are interpretable for the decision-making of the DNNs. To validate the effectiveness of the proposed IB-inspired attention mechanism, we apply it to various computer vision tasks including image classification, fine-grained recognition, cross-domain classification, semantic segmentation, and object detection. Extensive experiments demonstrate that it bootstraps standard DNN structures quantitatively and qualitatively for these tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodality-guided Visual-Caption Semantic Enhancement 多模态引导下的视觉字幕语义增强
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-23 DOI: 10.1016/j.cviu.2024.104139
{"title":"Multimodality-guided Visual-Caption Semantic Enhancement","authors":"","doi":"10.1016/j.cviu.2024.104139","DOIUrl":"10.1016/j.cviu.2024.104139","url":null,"abstract":"<div><div>Video captions generated with single modality, e.g. video clips, often suffer from insufficient event discovery and inadequate scene description. Therefore, this paper aims to improve the quality of captions by addressing these issues through the integration of multi-modal information. Specifically, We first construct a multi-modal dataset and introduce the triplet annotations of video, audio and text, fostering a comprehensive exploration about the associations between different modalities. Build upon this, We propose to explore the collaborative perception of audio and visual concepts to mitigate inaccuracies and incompleteness in captions in vision-based benchmarks by incorporating audio-visual perception priors. To achieve this, we extract effective semantic features from visual and auditory modalities, bridge the semantic gap between audio-visual modalities and text, and form a more precise knowledge graph multimodal coherence checking and information pruning mechanism. Exhaustive experiments demonstrate that the proposed approach surpasses existing methods and generalizes well with the assistance of ChatGPT.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bridging the gap between object detection in close-up and high-resolution wide shots 缩小特写镜头和高分辨率广角镜头之间的差距
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-23 DOI: 10.1016/j.cviu.2024.104181
{"title":"Bridging the gap between object detection in close-up and high-resolution wide shots","authors":"","doi":"10.1016/j.cviu.2024.104181","DOIUrl":"10.1016/j.cviu.2024.104181","url":null,"abstract":"<div><div>Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform accurate and efficient object detection with such large input in low-power edge devices like UAVs. There are several unique challenges in HRW shots. (1) Sparse information: the objects of interest cover less area. (2) Various scale: there is 10 to 100<span><math><mo>×</mo></math></span> object scale change in one single image. (3) Incomplete objects: the sliding window strategy to handle the large input leads to truncated objects at the window edge. (4) Multi-scale information: it is unclear how to use multi-scale information in training and inference. Consequently, directly using a close-up detector leads to inaccuracy and inefficiency. In this paper, we systematically investigate this problem and bridge the gap between object detection in close-up and HRW shots, by introducing a novel sparse architecture that can be integrated with common networks like ConvNet and Transformer. It leverages alternative sparse learning to complementarily fuse coarse-grained and fine-grained features to (1) adaptively extract valuable information from (2) different object scales. We also propose a novel Cross-window Non-Maximum Suppression (C-NMS) algorithm to (3) improve the box merge from different windows. Furthermore, we propose a (4) simple yet effective multi-scale training and inference strategy to improve accuracy. Experiments on two benchmarks with HRW shots, PANDA and DOTA-v1.0, demonstrate that our methods significantly improve accuracy (up to 5.8%) and speed (up to 3<span><math><mo>×</mo></math></span>) over SotAs, for both ConvNet or Transformer based detectors, on edge devices. Our code is open-sourced and available at <span><span>https://github.com/liwenxi/SparseFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deformable surface reconstruction via Riemannian metric preservation 通过黎曼度量保全重建可变形曲面
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-19 DOI: 10.1016/j.cviu.2024.104155
{"title":"Deformable surface reconstruction via Riemannian metric preservation","authors":"","doi":"10.1016/j.cviu.2024.104155","DOIUrl":"10.1016/j.cviu.2024.104155","url":null,"abstract":"<div><div>Estimating the pose of an object from a monocular image is a fundamental inverse problem in computer vision. Due to its ill-posed nature, solving this problem requires incorporating deformation priors. In practice, many materials do not perceptibly shrink or extend when manipulated, constituting a reliable and well-known prior. Mathematically, this translates to the preservation of the Riemannian metric. Neural networks offer the perfect playground to solve the surface reconstruction problem as they can approximate surfaces with arbitrary precision and allow the computation of differential geometry quantities. This paper presents an approach for inferring continuous deformable surfaces from a sequence of images, which is benchmarked against several techniques and achieves state-of-the-art performance without the need for offline training. Being a method that performs per-frame optimization, our method can refine its estimates, contrary to those based on performing a single inference step. Despite enforcing differential geometry constraints at each update, our approach is the fastest of all the tested optimization-based methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002364/pdfft?md5=e37118b164489f2910fb59a519a86d29&pid=1-s2.0-S1077314224002364-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LCMA-Net: A light cross-modal attention network for streamer re-identification in live video LCMA-Net:用于实时视频中流媒体再识别的轻型跨模态注意力网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-19 DOI: 10.1016/j.cviu.2024.104183
{"title":"LCMA-Net: A light cross-modal attention network for streamer re-identification in live video","authors":"","doi":"10.1016/j.cviu.2024.104183","DOIUrl":"10.1016/j.cviu.2024.104183","url":null,"abstract":"<div><div>With the rapid expansion of the we-media industry, streamers have increasingly incorporated inappropriate content into live videos to attract traffic and pursue interests. Blacklisted streamers often forge their identities or switch platforms to continue streaming, causing significant harm to the online environment. Consequently, streamer re-identification (re-ID) has become of paramount importance. Streamer biometrics in live videos exhibit multimodal characteristics, including voiceprints, faces, and spatiotemporal information, which complement each other. Therefore, we propose a light cross-modal attention network (LCMA-Net) for streamer re-ID in live videos. First, the voiceprint, face, and spatiotemporal features of the streamer are extracted by RawNet-SA, <span><math><mi>Π</mi></math></span>-Net, and STDA-ResNeXt3D, respectively. We then design a light cross-modal pooling attention (LCMPA) module, which, combined with a multilayer perceptron (MLP), aligns and concatenates different modality features into multimodal features within the LCMA-Net. Finally, the streamer is re-identified by measuring the similarity between these multimodal features. Five experiments were conducted on the StreamerReID dataset, and the results demonstrated that the proposed method achieved competitive performance. The dataset and code are available at <span><span>https://github.com/BJUT-AIVBD/LCMA-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Specular highlight removal using Quaternion transformer 利用四元数变换器去除镜面反射高光
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-19 DOI: 10.1016/j.cviu.2024.104179
{"title":"Specular highlight removal using Quaternion transformer","authors":"","doi":"10.1016/j.cviu.2024.104179","DOIUrl":"10.1016/j.cviu.2024.104179","url":null,"abstract":"<div><div>Specular highlight removal is a very important issue, because specular highlight reflections in images with illumination changes can give very negative effects on various computer vision and image processing tasks. Numerous state-of-the-art networks for the specular removal use convolutional neural networks (CNN), which cannot learn global context effectively. They capture spatial information while overlooking 3D structural correlation information of an RGB image. To address this problem, we introduce a specular highlight removal network based on Quaternion transformer (QformerSHR), which employs a transformer architecture based on Quaternion representation. In particular, a depth-wise separable Quaternion convolutional layer (DSQConv) is proposed to enhance computational performance of QformerSHR, while efficiently preserving the structural correlation of an RGB image by utilizing the Quaternion representation. In addition, a Quaternion transformer block (QTB) based on DSQConv learns global context. As a result, QformerSHR consisting of DSQConv and QTB performs the specular removal from natural and text image datasets effectively. Experimental results demonstrate that it is significantly more effective than state-of-the-art networks for the specular removal, in terms of both quantitative performance and subjective quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating optical flow: A comprehensive review of the state of the art 估计光流:最新技术综述
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-09-16 DOI: 10.1016/j.cviu.2024.104160
{"title":"Estimating optical flow: A comprehensive review of the state of the art","authors":"","doi":"10.1016/j.cviu.2024.104160","DOIUrl":"10.1016/j.cviu.2024.104160","url":null,"abstract":"<div><div>Optical flow estimation is a crucial task in computer vision that provides low-level motion information. Despite recent advances, real-world applications still present significant challenges. This survey provides an overview of optical flow techniques and their application. For a comprehensive review, this survey covers both classical frameworks and the latest AI-based techniques. In doing so, we highlight the limitations of current benchmarks and metrics, underscoring the need for more representative datasets and comprehensive evaluation methods. The survey also highlights the importance of integrating industry knowledge and adopting training practices optimized for deep learning-based models. By addressing these issues, future research can aid the development of robust and efficient optical flow methods that can effectively address real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002418/pdfft?md5=0e040acf6e4116194d80885aeb4b2b49&pid=1-s2.0-S1077314224002418-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142312277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信