Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim
{"title":"Modality mixer exploiting complementary information for multi-modal action recognition","authors":"Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim","doi":"10.1016/j.cviu.2025.104358","DOIUrl":"10.1016/j.cviu.2025.104358","url":null,"abstract":"<div><div>Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (<em>e</em>.<em>g</em>., RGB) with action content features of other modalities (<em>e</em>.<em>g</em>., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates separate learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104358"},"PeriodicalIF":4.3,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wancheng Feng , Yingchao Liu , Jiaming Pei , Guangliang Cheng , Lukun Wang
{"title":"Local Consistency Guidance: Personalized Stylization Method of Face Video","authors":"Wancheng Feng , Yingchao Liu , Jiaming Pei , Guangliang Cheng , Lukun Wang","doi":"10.1016/j.cviu.2025.104339","DOIUrl":"10.1016/j.cviu.2025.104339","url":null,"abstract":"<div><div>Face video stylization aims to transform real face videos into specific reference styles. Although image stylization has achieved remarkable results, maintaining continuity and accurately preserving original facial expressions in video stylization remains a significant challenge. This work introduces a novel approach for face video stylization that ensures consistent quality across the entire video by leveraging local consistency. Specifically, the framework builds upon existing diffusion models and employs local consistency as a guiding principle. It integrates a Local-Cross Attention (LCA) module to maintain style consistency between frames and a Local Style Transfer (LST) module to ensure seamless video continuity. Comparative experiments were conducted, along with qualitative and quantitative analyses using frame consistency, SSIM, FID, LPIPS, user studies, and flow similarity parameters. An ablation experiment section is also included. The experimental results demonstrate that the proposed approach effectively achieves continuous video stylization by applying local consistency guidance. Additionally, the Local Consistency Guidance (LCG) method shows strong performance in achieving continuous video stylization. After extensive investigation, this work achieves state-of-the-art results in the field of video stylization. Further information is available on the project homepage at <span><span>https://lcgfacevideostylization.github.io/github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104339"},"PeriodicalIF":4.3,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samiran Dey , Partha Basuchowdhuri , Debasis Mitra , Robin Augustine , Sanjoy Kumar Saha , Tapabrata Chakraborti
{"title":"Uncertainty estimation using boundary prediction for medical image super-resolution","authors":"Samiran Dey , Partha Basuchowdhuri , Debasis Mitra , Robin Augustine , Sanjoy Kumar Saha , Tapabrata Chakraborti","doi":"10.1016/j.cviu.2025.104349","DOIUrl":"10.1016/j.cviu.2025.104349","url":null,"abstract":"<div><div>Medical image super-resolution can be performed by several deep learning frameworks. However, as the safety of each patient is of primary concern, having models with a high degree of population level accuracy is not enough. Instead of a one size fits all approach, there is a need to measure the reliability and trustworthiness of such models from the point of view of personalized healthcare and precision medicine. Hence, in this paper, we propose a novel approach to predict a range of super-resolved (SR) images that any generative super-resolution model may yield for a given low-resolution (LR) image using residual image prediction. Providing multiple images within the suggested lower and upper bound increases the probability of finding an exact match to the high-resolution (HR) image. To further compare models and provide reliability scores, we estimate the coverage and uncertainty of the models and check if coverage can be improved at the cost of increasing uncertainty. Experimental results on lung CT scans from LIDC-IDRI and Radiopedia COVID-19 CT Images Segmentation datasets show that our models, BliMSR and MoMSGAN, provide the best HR and SR coverage at different levels of residual attention with a comparatively lower uncertainty. We believe our model agnostic approach to uncertainty estimation for generative medical imaging is the first of its kind and would help clinicians decide on the trustworthiness of any super-resolution model in a generalized manner while providing alternate SR images with enhanced details for better diagnosis for each individual patient.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104349"},"PeriodicalIF":4.3,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143724297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaiyan Cao , Jiawen Peng , Jiaxin Chen , Xinyuan Hou , Andy J. Ma
{"title":"Adversarial Style Mixup and Improved Temporal Alignment for Cross-Domain Few-Shot Action Recognition","authors":"Kaiyan Cao , Jiawen Peng , Jiaxin Chen , Xinyuan Hou , Andy J. Ma","doi":"10.1016/j.cviu.2025.104341","DOIUrl":"10.1016/j.cviu.2025.104341","url":null,"abstract":"<div><div>Cross-Domain Few-Shot Action Recognition (CDFSAR) aims at transferring knowledge from base classes to novel ones with limited labeled data, under distribution shift between base (source domain) and novel (target domain) classes. This paper addresses the issues of insufficient style coverage for the target domain and potential temporal misalignment with chronological order in existing methods. To mitigate distribution shifts across domains, we propose an Adversarial Style Mixup (ASM) module, which enriches the diversity of style distributions covering the target domain. ASM mixes up source and target domain styles through statistical means and variances, with the adversarially learned mixup ratio and style noise. On the other hand, we design an Improved Temporal Alignment (ITA) module to address the issue of temporal misalignment between videos. In the proposed ITA, keyframes are extracted as priors for better temporal alignment with a temporal mixer to reduce the misalignment noise. Extensive experiments on video action recognition datasets demonstrates the superiority of our method compared with the state of the arts for the challenging problem of CDFSAR. Ablation study validates that both the proposed ASM and ITA modules contribute to performance improvement by style distribution expansion and keyframe-based temporal alignment.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"255 ","pages":"Article 104341"},"PeriodicalIF":4.3,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143680941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Syntactically and semantically enhanced captioning network via hybrid attention and POS tagging prompt","authors":"Deepali Verma, Tanima Dutta","doi":"10.1016/j.cviu.2025.104340","DOIUrl":"10.1016/j.cviu.2025.104340","url":null,"abstract":"<div><div>Video captioning has become a thriving research area, with current methods relying on static visuals or motion information. However, videos contain a complex interplay between multiple objects with unique temporal patterns. Traditional techniques struggle to capture this intricate connection, leading to inaccurate captions due to the gap between video features and generated text. Analyzing these temporal variations and identifying relevant objects remains a challenge. This paper proposes SySCapNet, a novel deep-learning architecture for video captioning, designed to address this limitation. SySCapNet effectively captures objects involved in motions and extracts spatio-temporal action features. This information, along with visual features and motion data, guides the caption generation process. We introduce a groundbreaking hybrid attention module that leverages both visual saliency and spatio-temporal dynamics to extract highly detailed and semantically meaningful features. Furthermore, we incorporate part-of-speech tagging to guide the network in disambiguating words and understanding their grammatical roles. Extensive evaluations on benchmark datasets demonstrate that SySCapNet achieves superior performance compared to existing methods. The generated captions are not only informative but also grammatically correct and rich in context, surpassing the limitations of basic AI descriptions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"255 ","pages":"Article 104340"},"PeriodicalIF":4.3,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143643751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hexagonal mesh-based neural rendering for real-time rendering and fast reconstruction","authors":"Yisu Zhang, Jianke Zhu, Lixiang Lin","doi":"10.1016/j.cviu.2025.104335","DOIUrl":"10.1016/j.cviu.2025.104335","url":null,"abstract":"<div><div>Although recent neural rendering-based methods can achieve high-quality geometry and realistic rendering results in multi-view reconstruction, they incur a heavy computational burden on rendering and training, which limits their application scenarios. To address these challenges, we propose an effective mesh-based neural rendering approach which leverages the characteristic of meshes being able to achieve real-time rendering. Besides, a coarse-to-fine scheme is introduced to efficiently extract the initial mesh so as to significantly reduce the reconstruction time. More importantly, we suggest a hexagonal mesh model to preserve surface regularity by constraining the second-order derivatives of its vertices, where only low level of positional encoding is engaged for neural rendering. Experiments show that our approach significantly reduces the rendering time from several tens of seconds to 0.05s compared to methods based on implicit representation. And it can quickly achieve state-of-the-art results in novel view synthesis and reconstruction. Our full implementation will be made publicly available at <span><span>https://github.com/FuchengSu/FastMesh</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"255 ","pages":"Article 104335"},"PeriodicalIF":4.3,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FrTrGAN: Single image dehazing using the frequency component of transmission maps in the generative adversarial network","authors":"Pulkit Dwivedi , Soumendu Chakraborty","doi":"10.1016/j.cviu.2025.104336","DOIUrl":"10.1016/j.cviu.2025.104336","url":null,"abstract":"<div><div>Hazy images, particularly in outdoor scenes, have reduced visibility due to atmospheric particles, making image dehazing a critical task for enhancing visual clarity. The main challenges in image dehazing involve accurately detecting and removing haze while preserving fine details and maintaining overall image quality. Many existing dehazing methods struggle with varying haze conditions, often compromising the structural and perceptual integrity of the restored images. In this paper, we introduce FrTrGAN, a framework for single-image dehazing that leverages the frequency components of transmission maps. This novel framework addresses these challenges by integrating the Fourier Transform within an enhanced CycleGAN architecture. Unlike traditional spatial-domain dehazing methods, FrTrGAN operates in the frequency domain, where it isolates low-frequency haze components – responsible for blurring fine details – and removes them more precisely. The Inverse Fourier Transform is then applied to map the refined data back to the spatial domain, ensuring that the resulting images maintain clarity, sharpness, and structural integrity. We evaluate our method on multiple datasets, including HSTS, SOTS Outdoor, O-Haze, I-Haze, D-Hazy, BeDDE and Dense-Haze using PSNR and SSIM for quantitative performance assessment. Additionally, we include results based on non-referential metrics such as FADE, SSEQ, BRISQUE and NIQE to further evaluate the perceptual quality of the dehazed images. The results demonstrate that FrTrGAN significantly outperforms existing methods while effectively restoring both frequency components and perceptual image quality. This comprehensive evaluation highlights the robustness of FrTrGAN in diverse haze conditions and underscores the effectiveness of a frequency-domain approach to image dehazing, laying the groundwork for future advancements in the field.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"255 ","pages":"Article 104336"},"PeriodicalIF":4.3,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143577983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingzhou Xu , Zhaoyong Mao , Xin Wang , Qinhao Tu , Junge Shen
{"title":"Dynamic Anchor: Density Map Guided Small Object Detector for Tiny Persons","authors":"Xingzhou Xu , Zhaoyong Mao , Xin Wang , Qinhao Tu , Junge Shen","doi":"10.1016/j.cviu.2025.104325","DOIUrl":"10.1016/j.cviu.2025.104325","url":null,"abstract":"<div><div>With the application of aerial and space-based equipments, such as drones in the search and rescue process, there is an increasing demand on the detection of small and even tiny human targets. However, most existing detectors rely on generating smaller and denser anchors for small target detection, which introduces a high number of redundant negative anchor samples. To alleviate this issue, we propose a novel density map-guided tiny person detector with dynamic anchor. Specifically, we elaborately design an Anchor Proposals Mask (APM) module to effectively eliminate negative anchor samples and adaptively adjust anchor distribution with the guidance of density maps produced by Density Map Generator (DMG). To promote the quality of the density map, we develop a Multi-Scale Feature Distillation (MSFD) module and incorporate the Focal Inverse Distance Transform (FIDT) map to conduct knowledge distillation for DMG with the assistance of the crowd counting network. Extensive experiments on the TinyPerson and VisDrone datasets demonstrate that our method significantly enhances the performance of two-stage detectors in terms of average precision (AP) and average recall (AR) while effectively reducing the impact of negative anchor boxes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"255 ","pages":"Article 104325"},"PeriodicalIF":4.3,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143619360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying
{"title":"Joint image-instance spatial–temporal attention for few-shot action recognition","authors":"Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying","doi":"10.1016/j.cviu.2025.104322","DOIUrl":"10.1016/j.cviu.2025.104322","url":null,"abstract":"<div><div>Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST) for Few-shot Action Recognition. The core concept of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104322"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Ismail-Fawaz , Maxime Devanne , Stefano Berretti , Jonathan Weber , Germain Forestier
{"title":"Establishing a unified evaluation framework for human motion generation: A comparative analysis of metrics","authors":"Ali Ismail-Fawaz , Maxime Devanne , Stefano Berretti , Jonathan Weber , Germain Forestier","doi":"10.1016/j.cviu.2025.104337","DOIUrl":"10.1016/j.cviu.2025.104337","url":null,"abstract":"<div><div>The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using two publicly available datasets, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code: <span><span>https://github.com/MSD-IRIMAS/Evaluating-HMG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104337"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}