{"title":"Continuous conditional video synthesis by neural processes","authors":"Xi Ye, Guillaume-Alexandre Bilodeau","doi":"10.1016/j.cviu.2025.104387","DOIUrl":"10.1016/j.cviu.2025.104387","url":null,"abstract":"<div><div>Different conditional video synthesis tasks, such as frame interpolation and future frame prediction, are typically addressed individually by task-specific models, despite their shared underlying characteristics. Additionally, most conditional video synthesis models are limited to discrete frame generation at specific integer time steps. This paper presents a unified model that tackles both challenges simultaneously. We demonstrate that conditional video synthesis can be formulated as a neural process, where input spatio-temporal coordinates are mapped to target pixel values by conditioning on context spatio-temporal coordinates and pixel values. Our approach leverages a Transformer-based non-autoregressive conditional video synthesis model that takes the implicit neural representation of coordinates and context pixel features as input. Our task-specific models outperform previous methods for future frame prediction and frame interpolation across multiple datasets. Importantly, our model enables temporal continuous video synthesis at arbitrary high frame rates, outperforming the previous state-of-the-art. The source code and video demos for our model are available at <span><span>https://npvp.github.io</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104387"},"PeriodicalIF":4.3,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsung-Shan Yang , Yun-Cheng Wang , Chengwei Wei , Suya You , C.-C. Jay Kuo
{"title":"Efficient human–object-interaction (EHOI) detection via interaction label coding and Conditional Decision","authors":"Tsung-Shan Yang , Yun-Cheng Wang , Chengwei Wei , Suya You , C.-C. Jay Kuo","doi":"10.1016/j.cviu.2025.104390","DOIUrl":"10.1016/j.cviu.2025.104390","url":null,"abstract":"<div><div>Human–Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method. The codes are available: <span><span>https://github.com/keevin60907/EHOI---Efficient-Human-Object-Interaction-Detector</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104390"},"PeriodicalIF":4.3,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144115541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung
{"title":"HVQ-VAE: Variational auto-encoder with hyperbolic vector quantization","authors":"Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung","doi":"10.1016/j.cviu.2025.104392","DOIUrl":"10.1016/j.cviu.2025.104392","url":null,"abstract":"<div><div>Vector quantized-variational autoencoder (VQ-VAE) and its variants have made significant progress in creating discrete latent space via learning a codebook. Previous works on VQ-VAE have focused on discrete latent spaces in Euclidean or in spherical spaces. This paper studies the geometric prior of hyperbolic spaces as a way to improve the learning capacity of VQ-VAE. That being said, working with the VQ-VAE in the hyperbolic space is not without difficulties, and the benefits of using hyperbolic space as the geometric prior for the latent space have never been studied in VQ-VAE. We bridge this gap by developing the VQ-VAE with hyperbolic vector quantization. To this end, we propose the hyperbolic VQ-VAE (HVQ-VAE), which learns the latent embedding of data and the codebook in the hyperbolic space. Specifically, we endow the discrete latent space in the Poincaré ball, such that the clustering algorithm can be formulated and optimized in the Poincaré ball. Thorough experiments against various baselines are conducted to evaluate the superiority of the proposed HVQ-VAE empirically. We show that HVQ-VAE enjoys better image reconstruction, effective codebook usage, and fast convergence than baselines. We also present evidence that HVQ-VAE outperforms VQ-VAE in low-dimensional latent space.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104392"},"PeriodicalIF":4.3,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pietro Ruiu , Marinella Iole Cadoni , Andrea Lagorio , Seth Nixon , Filippo Casu , Massimo Farina , Mauro Fadda , Giuseppe A. Trunfio , Massimo Tistarelli , Enrico Grosso
{"title":"Uniss-MDF: A Multidimensional Face dataset for assessing face analysis on the move","authors":"Pietro Ruiu , Marinella Iole Cadoni , Andrea Lagorio , Seth Nixon , Filippo Casu , Massimo Farina , Mauro Fadda , Giuseppe A. Trunfio , Massimo Tistarelli , Enrico Grosso","doi":"10.1016/j.cviu.2025.104384","DOIUrl":"10.1016/j.cviu.2025.104384","url":null,"abstract":"<div><div>Multidimensional 2D–3D face analysis has demonstrated a strong potential for human identification in several application domains. The combined, synergic use of 2D and 3D data from human faces can counteract typical limitations in 2D face recognition, while improving both accuracy and robustness in identification. On the other hand, current mobile devices, often equipped with depth cameras and high performance computing resources, offer a powerful and practical tool to better investigate new models to jointly process real 2D and 3D face data. However, recent concerns related to privacy of individuals and the collection, storage and processing of personally identifiable biometric information have diminished the availability of public face recognition datasets.</div><div>Uniss-MDF (Uniss-MultiDimensional Face) represents the first collection of combined 2D–3D data of human faces captured with a mobile device. Over 76,000 depth images and videos are captured from over 100 subjects, in both controlled and uncontrolled conditions, over two sessions. The features of Uniss-MDF are extensively compared with existing 2D–3D face datasets. The reported statistics underscore the value of the dataset as a versatile resource for researchers in face recognition on the move and for a wide range of applications. Notably, it is the sole 2D–3D facial dataset using data from a mobile device that includes both 2D and 3D synchronized sequences acquired in controlled and uncontrolled conditions. The Uniss-MDF dataset and the proposed experimental protocols with baseline results provide a new platform to compare processing models for novel research avenues in advanced face analysis on the move.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104384"},"PeriodicalIF":4.3,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale feature fusion based SAM for high-quality few-shot medical image segmentation","authors":"Shangwang Liu, Ruonan Xu","doi":"10.1016/j.cviu.2025.104389","DOIUrl":"10.1016/j.cviu.2025.104389","url":null,"abstract":"<div><div>Applying the Segmentation Everything Model (SAM) to the field of medical image segmentation is a great challenge due to the significant differences between natural and medical images. Direct fine-tuning of SAM using medical images requires a large amount of exhaustively annotated medical image data. This paper aims to propose a new method, High-quality Few-shot Segmentation Everything Model (HF-SAM), to address these issues and achieve efficient medical image segmentation. We proposed HF-SAM, which requires only a small number of medical images for model training and does not need precise medical cues for fine-tuning SAM. HF-SAM employs Low-rank adaptive (LoRA) technology to fine-tune SAM by leveraging the lack of large local details in the image embedding of SAM’s mask decoder and the complementarity between high-level global and low-level local features. Additionally, we propose an Adaptive Weighted Feature Fusion Module (AWFFM) and a two-step skip-feature fusion decoding process. The AWFFM integrates low-level local information into high-level global features without suppressing global information, while the two-step skip-feature fusion decoding process enhances SAM’s ability to capture fine-grained information and local details. Experimental results show that HF-SAM achieves Dice scores of 79.50% on the Synapse dataset and 88.68% on the ACDC dataset. These results outperform existing traditional methods, semi-supervised methods, and other SAM variants in few-shot medical image segmentation. By combining low-rank adaptive technology and the adaptive weighted feature fusion module, HF-SAM effectively addresses the adaptability issues of SAM in medical image segmentation and demonstrates excellent segmentation performance with few samples. This method provides a new solution for the field of medical image segmentation and holds significant application value. The code of HF-SAM is available at <span><span>https://github.com/1683194873xrn/HF-SAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104389"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting few-shot point cloud segmentation with intra-class correlation and iterative prototype fusion","authors":"Xindan Zhang , Ying Li , Xinnian Zhang","doi":"10.1016/j.cviu.2025.104393","DOIUrl":"10.1016/j.cviu.2025.104393","url":null,"abstract":"<div><div>Semantic segmentation of 3D point clouds is often limited by the challenge of obtaining labeled data. Few-shot point cloud segmentation methods, which can learn previously unseen categories, help reduce reliance on labeled datasets. However, existing methods are susceptible to correlation noise and suffer from significant discrepancies between support prototypes and query features. To address these issues, we first introduce an intra-class correlation enhancement module for filtering correlation noise driven by inter-class similarity and intra-class diversity. Second, to better represent the target classes, we propose an iterative prototype fusion module that adapts the query point cloud feature space, mitigating the problem of object variations in the support set and query set. Extensive experiments on S3DIS and ScanNet benchmark datasets demonstrate that our approach achieves competitive performance with state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104393"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia
{"title":"TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis","authors":"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia","doi":"10.1016/j.cviu.2025.104391","DOIUrl":"10.1016/j.cviu.2025.104391","url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104391"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CSPPNet: Cascade space pyramid pooling network for object detection","authors":"Yafeng Liu, Yongsheng Dong","doi":"10.1016/j.cviu.2025.104377","DOIUrl":"10.1016/j.cviu.2025.104377","url":null,"abstract":"<div><div>Real-time object detection, as an important research direction in the field of computer vision, aims to achieve fast and accurate object detection. However, many current methods fail to achieve a balance between speed, parameters, and accuracy. To alleviate this problem, in this paper, we construct a novel cascade spatial pyramid pooling network (CSPPNet) for object detection. In particular, we first propose a cascade feature fusion (CFF) module, which combines the novel cascade cross-layer structure and GSConv convolution to lighten the existing necking structure and improve the detection accuracy of the model without adding a large number of parameters. In addition, in order to alleviate the loss of feature detail information due to max pooling, we further propose the nest space pooling (NSP) module, which combines nest feature fusion with max pooling operations to improve the fusion performance of local feature information with global feature information. Experimental results show that our CSPPNet is competitive, achieving 43.1% AP on the MS-COCO 2017 test-dev dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104377"},"PeriodicalIF":4.3,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He
{"title":"EUN: Enhanced unlearnable examples generation approach for privacy protection","authors":"Xiaotian Chen , Yang Xu , Sicong Zhang , Jiale Yan , Weida Xu , Xinlong He","doi":"10.1016/j.cviu.2025.104388","DOIUrl":"10.1016/j.cviu.2025.104388","url":null,"abstract":"<div><div>In the era of artificial intelligence, the importance of protecting user privacy has become increasingly prominent. Unlearnable examples prevent deep learning models from learning semantic features in images by adding perturbations or noise that are imperceptible to the human eye. Existing perturbation generation methods are not robust to defense methods or are only robust to one defense method. To address this problem, we propose an enhanced perturbation generation method for unlearnable examples. This method generates the perturbation by performing a class-wise convolution on the image and changing a pixel in the local position of the image. This method is robust to multiple defense methods. In addition, by adjusting the order of global position convolution and local position pixel change of the image, variants of the method were generated and analyzed. We have tested our method on a variety of datasets with a variety of models, and compared with 6 perturbation generation methods. The results demonstrate that the clean test accuracy of the enhanced perturbation generation method for unlearnable examples is still less than 35% when facing defense methods such as image shortcut squeezing, adversarial training, and adversarial augmentation. It outperforms existing perturbation generation methods in many aspects, and is 20% lower than CUDA and OPS, two excellent perturbation generation methods, under several parameter settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104388"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan
{"title":"AODGCN: Adaptive object detection with attention-guided dynamic graph convolutional network","authors":"Meng Zhang, Yina Guo, Haidong Wang, Hong Shangguan","doi":"10.1016/j.cviu.2025.104386","DOIUrl":"10.1016/j.cviu.2025.104386","url":null,"abstract":"<div><div>Various classifiers based on convolutional neural networks have been successfully applied to image classification in object detection. However, object detection is much more sophisticated and most classifiers used in this context exhibit limitations in capturing contextual information, particularly in scenarios with complex backgrounds or occlusions. Additionally, they lack spatial awareness, resulting in the loss of spatial structure and inadequate modeling of object details and context. In this paper, we propose an adaptive object detection approach using an attention-guided dynamic graph convolutional network (AODGCN). AODGCN represents images as graphs, enabling the capture of object properties such as connectivity, proximity, and hierarchical relationships. Attention mechanisms guide the model to focus on informative regions, highlighting relevant features while suppressing background information. This attention-guided approach enhances the model’s ability to capture discriminative features. Furthermore, the dynamic graph convolutional network (D-GCN) adjusts the receptive field size and weight coefficients based on object characteristics, enabling adaptive detection of objects with varying sizes. The achieved results demonstrate the effectiveness of AODGCN on the MS-COCO 2017 dataset, with a significant improvement of 1.6% in terms of mean average precision (mAP) compared to state-of-the-art algorithms.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104386"},"PeriodicalIF":4.3,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}