{"title":"MoMa: Skinned motion retargeting using masked pose modeling","authors":"Giulia Martinelli, Nicola Garau, Niccoló Bisagno, Nicola Conci","doi":"10.1016/j.cviu.2024.104141","DOIUrl":"10.1016/j.cviu.2024.104141","url":null,"abstract":"<div><div>Motion retargeting requires to carefully analyze the differences in both skeletal structure and body shape between source and target characters. Existing skeleton-aware and shape-aware approaches can deal with such differences, but they struggle when the source and target characters exhibit significant dissimilarities in both skeleton (like joint count and bone length) and shape (like geometry and mesh properties). In this work we introduce MoMa, a novel approach for skinned motion retargeting which is both skeleton and shape-aware. Our skeleton-aware module learns to retarget animations by recovering the differences between source and target using a custom transformer-based auto-encoder coupled with a spatio-temporal masking strategy. The auto-encoder can transfer the motion between input and target skeletons by reconstructing the masked skeletal differences using shared joints as a reference point. Surpassing the limitations of previous approaches, we can also perform retargeting between skeletons with a varying number of leaf joints. Our shape-aware module incorporates a novel face-based optimizer that adapts skeleton positions to limit collisions between body parts. In contrast to conventional vertex-based methods, our face-based optimizer excels in resolving surface collisions within a body shape, resulting in more accurate retargeted motions. The proposed architecture outperforms the state-of-the-art results on the Mixamo dataset, both quantitatively and qualitatively. Our code is available at: [Github link upon acceptance, see supplementary materials].</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104141"},"PeriodicalIF":4.3,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer fusion for indoor RGB-D semantic segmentation","authors":"Zongwei Wu , Zhuyun Zhou , Guillaume Allibert , Christophe Stolz , Cédric Demonceaux , Chao Ma","doi":"10.1016/j.cviu.2024.104174","DOIUrl":"10.1016/j.cviu.2024.104174","url":null,"abstract":"<div><div>Fusing geometric cues with visual appearance is an imperative theme for RGB-D indoor semantic segmentation. Existing methods commonly adopt convolutional modules to aggregate multi-modal features, paying little attention to explicitly leveraging the long-range dependencies in feature fusion. Therefore, it is challenging for existing methods to accurately segment objects with large-scale variations. In this paper, we propose a novel transformer-based fusion scheme, named TransD-Fusion, to better model contextualized awareness. Specifically, TransD-Fusion consists of a self-refinement module, a calibration scheme with cross-interaction, and a depth-guided fusion. The objective is to first improve modality-specific features with self- and cross-attention, and then explore the geometric cues to better segment objects sharing a similar visual appearance. Additionally, our transformer fusion benefits from a semantic-aware position encoding which spatially constrains the attention to neighboring pixels. Extensive experiments on RGB-D benchmarks demonstrate that the proposed method performs well over the state-of-the-art methods by large margins.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104174"},"PeriodicalIF":4.3,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable video transformer for full-frame video prediction","authors":"Zhan Li, Feng Liu","doi":"10.1016/j.cviu.2024.104166","DOIUrl":"10.1016/j.cviu.2024.104166","url":null,"abstract":"<div><div>Vision Transformers (ViTs) have shown success in many low-level computer vision tasks. However, existing ViT models are limited by their high computation and memory cost when generating high-resolution videos for tasks like video prediction. This paper presents a scalable video transformer for full-frame video predication. Specifically, we design a backbone transformer block for our video transformer. This transformer block decouples the temporal and channel features to reduce the computation cost when processing large-scale spatial–temporal video features. We use transposed attention to focus on the channel dimension instead of the spatial window to further reduce the computation cost. We also design a Global Shifted Multi-Dconv Head Transposed Attention module (GSMDTA) for our transformer block. This module is built upon two key ideas. First, we design a depth shift module to better incorporate the cross-channel or temporal information from video features. Second, we introduce a global query mechanism to capture global information to handle large motion for video prediction. This new transformer block enables our video transformer to predict a full frame from multiple past frames at the resolution of 1024 × 512 with 12 GB VRAM. Experiments on various video prediction benchmarks demonstrate that our method with only RGB input outperforms state-of-the-art methods that require additional data, like segmentation maps and optical flows. Our method exceeds the state-of-the-art RGB-only methods by a large margin (1.2 dB) in PSNR. Our method is also faster than state-of-the-art video prediction transformers.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104166"},"PeriodicalIF":4.3,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight convolutional neural network-based feature extractor for visible images","authors":"Xujie He, Jing Jin, Yu Jiang, Dandan Li","doi":"10.1016/j.cviu.2024.104157","DOIUrl":"10.1016/j.cviu.2024.104157","url":null,"abstract":"<div><p>Feature extraction networks (FENs), as the first stage in many computer vision tasks, play critical roles. Previous studies regarding FENs employed deeper and wider networks to attain higher accuracy, but their approaches were memory-inefficient and computationally intensive. Here, we present an accurate and lightweight feature extractor (RoShuNet) for visible images based on ShuffleNetV2. The provided improvements are threefold. To make ShuffleNetV2 compact without degrading its feature extraction ability, we propose an aggregated dual group convolutional module; to better aid the channel interflow process, we propose a <span><math><mi>γ</mi></math></span>-weighted shuffling module; to further reduce the complexity and size of the model, we introduce slimming strategies. Classification experiments demonstrate the state-of-the-art (SOTA) performance of RoShuNet, which yields an increase in accuracy and reduces the complexity and size of the model compared to those of ShuffleNetV2. Generalization experiments verify that the proposed method is also applicable to feature extraction tasks in semantic segmentation and multiple-object tracking scenarios, achieving comparable accuracy to that of other approaches with more memory and greater computational efficiency. Our method provides a novel perspective for designing lightweight models.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104157"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LightSOD: Towards lightweight and efficient network for salient object detection","authors":"Ngo-Thien Thu , Hoang Ngoc Tran , Md. Delowar Hossain , Eui-Nam Huh","doi":"10.1016/j.cviu.2024.104148","DOIUrl":"10.1016/j.cviu.2024.104148","url":null,"abstract":"<div><p>The recent emphasis has been on achieving rapid and precise detection of salient objects, which presents a challenge for resource-constrained edge devices because the current models are too computationally demanding for deployment. Some recent research has prioritized inference speed over accuracy to address this issue. In response to the inherent trade-off between accuracy and efficiency, we introduce an innovative framework called LightSOD, with the primary objective of achieving a balance between precision and computational efficiency. LightSOD comprises several vital components, including the spatial-frequency boundary refinement module (SFBR), which utilizes wavelet transform to restore spatial loss information and capture edge features from the spatial-frequency domain. Additionally, we introduce a cross-pyramid enhancement module (CPE), which utilizes adaptive kernels to capture multi-scale group-wise features in deep layers. Besides, we introduce a group-wise semantic enhancement module (GSRM) to boost global semantic features in the topmost layer. Finally, we introduce a cross-aggregation module (CAM) to incorporate channel-wise features across layers, followed by a triple features fusion (TFF) that aggregates features from coarse to fine levels. By conducting experiments on five datasets and utilizing various backbones, we have demonstrated that LSOD achieves competitive performance compared with heavyweight cutting-edge models while significantly reducing computational complexity.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104148"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002297/pdfft?md5=b9d62426fc2e76aa1cbe833773c6cfaa&pid=1-s2.0-S1077314224002297-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Action-conditioned contrastive learning for 3D human pose and shape estimation in videos","authors":"Inpyo Song , Moonwook Ryu , Jangwon Lee","doi":"10.1016/j.cviu.2024.104149","DOIUrl":"10.1016/j.cviu.2024.104149","url":null,"abstract":"<div><div>The aim of this research is to estimate 3D human pose and shape in videos, which is a challenging task due to the complex nature of the human body and the wide range of possible pose and shape variations. This problem also poses difficulty in finding a satisfactory solution due to the trade-off between the accuracy and temporal consistency of the estimated 3D pose and shape. Thus previous researches have prioritized one objective over the other. In contrast, we propose a novel approach called the action-conditioned mesh recovery (ACMR) model, which improves accuracy without compromising temporal consistency by leveraging human action information. Our ACMR model outperforms existing methods that prioritize temporal consistency in terms of accuracy, while also achieving comparable temporal consistency with other state-of-the-art methods. Significantly, the action-conditioned learning process occurs only during training, requiring no additional resources at inference time, thereby enhancing performance without increasing computational demands.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104149"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142358234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Triple-Stream Commonsense Circulation Transformer Network for Image Captioning","authors":"Jianchao Li, Wei Zhou, Kai Wang, Haifeng Hu","doi":"10.1016/j.cviu.2024.104165","DOIUrl":"10.1016/j.cviu.2024.104165","url":null,"abstract":"<div><p>Traditional image captioning methods only have a local perspective at the dataset level, allowing them to explore dispersed information within individual images. However, the lack of a global perspective prevents them from capturing common characteristics among similar images. To address the limitation, this paper introduces a novel <strong>T</strong>riple-stream <strong>C</strong>ommonsense <strong>C</strong>irculating <strong>T</strong>ransformer <strong>N</strong>etwork (TCCTN). It incorporates contextual stream into the encoder, combining enhanced channel stream and spatial stream for comprehensive feature learning. The proposed commonsense-aware contextual attention (CCA) module queries commonsense contextual features from the dataset, obtaining global contextual association information by projecting grid features into the contextual space. The pure semantic channel attention (PSCA) module leverages compressed spatial domain for channel pooling, focusing on attention weights of pure channel features to capture inherent semantic features. The region spatial attention (RSA) module enhances spatial concepts in semantic learning by incorporating region position information. Furthermore, leveraging the complementary differences among the three features, TCCTN introduces the mixture of experts strategy to enhance the unique discriminative ability of features and promote their integration in textual feature learning. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of contextual commonsense stream and the superior performance of TCCTN.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104165"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delving into CLIP latent space for Video Anomaly Recognition","authors":"Luca Zanella , Benedetta Liberatori , Willi Menapace , Fabio Poiesi , Yiming Wang , Elisa Ricci","doi":"10.1016/j.cviu.2024.104163","DOIUrl":"10.1016/j.cviu.2024.104163","url":null,"abstract":"<div><div>We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method <span><math><mrow><mi>A</mi><mi>n</mi><mi>o</mi><mi>m</mi><mi>a</mi><mi>l</mi><mi>y</mi><mi>C</mi><mi>L</mi><mi>I</mi><mi>P</mi></mrow></math></span>, the first to combine Vision and Language Models (VLMs), such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also leverage a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare <span><math><mrow><mi>A</mi><mi>n</mi><mi>o</mi><mi>m</mi><mi>a</mi><mi>l</mi><mi>y</mi><mi>C</mi><mi>L</mi><mi>I</mi><mi>P</mi></mrow></math></span> against state-of-the-art methods considering three major anomaly detection benchmarks, <em>i.e.</em> ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies. Project website and code are available at <span><span>https://lucazanella.github.io/AnomalyCLIP/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104163"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142327794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A convex Kullback–Leibler optimization for semi-supervised few-shot learning","authors":"Yukun Liu , Zhaohui Luo , Daming Shi","doi":"10.1016/j.cviu.2024.104152","DOIUrl":"10.1016/j.cviu.2024.104152","url":null,"abstract":"<div><p>Few-shot learning has achieved great success in many fields, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train a meta-learning system. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and the negligence of intra-class compactness problem. To address these issue, we propose a novel semi-supervised few-shot learning method with Convex Kullback–Leibler, hereafter referred to as CKL, in which KL divergence is employed to achieve global optimum solution by optimizing a strictly convex functions to perform clustering; whereas sample selection strategy is employed to achieve intra-class compactness. In training, the CKL is optimized iteratively via deep learning and expectation–maximization algorithm. Intensive experiments have been conducted on three popular benchmark data sets, take miniImagenet data set for example, our proposed CKL achieved 76.83% and 85.78% under 5-way 1-shot and 5-way 5-shot, the experimental results show that this method significantly improves the classification ability of few-shot learning tasks and obtains the start-of-the-art performance.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104152"},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He
{"title":"CAFNet: Context aligned fusion for depth completion","authors":"Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He","doi":"10.1016/j.cviu.2024.104158","DOIUrl":"10.1016/j.cviu.2024.104158","url":null,"abstract":"<div><p>Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104158"},"PeriodicalIF":4.3,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}