{"title":"Local Gaussian ensemble for arbitrary-scale image super-resolution","authors":"Chuan Chen, Weiwei Wang, Xixi Jia, Xiangchu Feng, Hanjia Wei","doi":"10.1016/j.cviu.2025.104372","DOIUrl":"10.1016/j.cviu.2025.104372","url":null,"abstract":"<div><div>In arbitrary-scale image super-resolution (SR), the local coordinate information is pivotal to enhancing performance through local ensemble. The previous method local implicit image function (LIIF) reconstructs pixels by using multi-layer perceptron (MLP), then refines each pixel by a weighted summation of nearby pixels (also called local ensemble), where the weight depends on the distances between the query pixel and the nearby pixels. Since the distances are fixed, so is the weighting mechanism, limiting the effectiveness of local ensemble. Furthermore, the weighted summation involves repeated reconstructions, increasing the computational cost. Orthogonal position encoding SR (OPE-SR) reduces pixel reconstruction complexity using orthogonal position encoding. However, it still relies on LIIF’s local ensemble method. Additionally, lacking scale information, OPE-SR demonstrates unstable performance across various datasets and scale factors. In this paper, we propose to conduct local ensemble in feature domain, and we present a new ensemble method, the local Gaussian ensemble (LGE), to utilize the local coordinate information more flexibly and efficiently. Specifically, we introduce learnable anisotropic 2D Gaussians for each query coordinate in the SR image, transforming normalized coordinates of nearby features into multiple Gaussian weights to effectively ensemble local features. Then a scale-aware deep MLP is applied only once for pixel reconstruction. Extensive experiments demonstrate that our LGE significantly reduces computational costs during both training and inference while delivering performance comparable to the existing local ensemble method. Moreover, our method consistently outperforms the existing parameter-free approach in terms of efficiency and stability across various benchmark datasets and scale factors.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104372"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143903853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAFNet: Rotation-aware anchor-free framework for geospatial object detection","authors":"Liwei Deng , Yangyang Tan , Songyu Chen","doi":"10.1016/j.cviu.2025.104373","DOIUrl":"10.1016/j.cviu.2025.104373","url":null,"abstract":"<div><div>Object detection in remote sensing images plays a crucial role in applications such as disaster monitoring, and urban planning. However, detecting small and rotated objects in complex backgrounds remains a significant challenge. Traditional anchor-based methods, which rely on preset anchor boxes with fixed sizes and aspect ratios, face three core limitations: geometric mismatch (difficulty adapting to rotated objects and feature confusion caused by dense anchor boxes), missed detection of small objects (feature loss due to the decoupling between anchor boxes and feature map strides), and parameter sensitivity (requiring complex anchor box combinations for multi-scale targets).</div><div>To address these challenges, this paper proposes an anchor-free detection framework, RAFNet, integrating three key innovations: Mona Swin Transformer as the backbone to enhance feature extraction, Rotated Feature Pyramid Network (Rotated FPN) for rotation-aware feature representation, and Local Importance-based Attention (LIA) mechanism to focus on critical regions and improve object feature representation. Extensive experiments on the DOTA1.0 dataset demonstrate that RAFNet achieves a mean Average Precision (mAP) of 74.91, outperforming baseline models by 3.24%, with significant improvements in challenging categories such as helicopters (+32.5% AP) and roundabouts (+4% AP). The model achieves the mAP of 30.29% on the STAR dataset, validating its high adaptability and robustness in generalization tasks. These results highlight the effectiveness of the proposed method in detecting small, rotated objects in complex scenes. RAFNet offers a more flexible, efficient, and generalizable solution for remote sensing object detection, underscoring the great potential of anchor-free approaches in this field.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104373"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ting Cai , Yu Xiong , Chengyang He , Chao Wu , Linqin Cai
{"title":"Classroom teacher behavior analysis: The TBU dataset and performance evaluation","authors":"Ting Cai , Yu Xiong , Chengyang He , Chao Wu , Linqin Cai","doi":"10.1016/j.cviu.2025.104376","DOIUrl":"10.1016/j.cviu.2025.104376","url":null,"abstract":"<div><div>Classroom videos are objective records of teaching behaviors, which provide evidence for teachers’ teaching reflection and evaluation. The intelligent identification, tracking and description of teacher teaching behavior based on classroom videos have become a research hotspot in the field of intelligent education to understand the teaching process of teachers. Although the recent attempts propose several promising directions for the analysis of teaching behavior, the existing public datasets are still insufficient to meet the need for these potential solutions due to lack of varied classroom environment, fine-grained teaching scene behavior data. To address this, we analyzed the influencing factors of teacher behavior and related video datasets, and constructed a diverse, scenario-specific, and multi-task dataset named TBU for Teacher Behavior Understanding. The TBU contains 37,026 high-quality teaching behavior clips, 9422 annotated teaching behavior clips with precise time boundaries, and 6098 teacher teaching behavior description clips annotated with multi-level atomic action labels of fine-grained behavior, spatial location, and interactive objects in four education stages. We performed a comprehensive statistical analysis of TBU and summarized the behavioral characteristics of teachers at different educational stages. Additionally, we systematically investigated representative methods for three video understanding tasks on TBU: behavior recognition, behavior detection, and behavior description, providing a benchmark for the research towards a more comprehensive understanding of teaching video data. Considering the specificity of classroom scenarios and the needs of teaching behavior analysis, we put forward new requirements for the existing baseline methods. We believe that TBU can facilitate in-depth research on classroom teacher teaching video analysis. TBU is available at: <span><span>https://github.com/cai-KU/TBU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104376"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143885973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolutional neural network framework for deepfake detection: A diffusion-based approach","authors":"Emmanuel Pintelas , Ioannis E. Livieris","doi":"10.1016/j.cviu.2025.104375","DOIUrl":"10.1016/j.cviu.2025.104375","url":null,"abstract":"<div><div>In the rapidly advancing domain of synthetic media, DeepFakes emerged as a potent tool for misinformation and manipulation. Nevertheless, the engineering challenge lies in detecting such content to ensure information integrity. Recent artificial intelligence contributions in deepfake detection have mainly concentrated around sophisticated convolutional neural network models, which derive insights from facial biometrics, including multi-attentional and multi-view mechanisms, pairwise/siamese, distillation learning technique and facial-geometry approaches. In this work, we consider a new diffusion-based neural network approach, rather than directly analyzing deepfake images for inconsistencies. Motivated by the considerable property of diffusion procedure of unveiling anomalies, we employ diffusion of the inherent structure of deepfake images, seeking for patterns throughout this process. Specifically, the proposed diffusion network, iteratively adds noise to the input image until it almost becomes pure noise. Subsequently, a convolutional neural network extracts features from the final diffused state, as well as from all transient states of the diffusion process. The comprehensive experimental analysis demonstrates the efficacy and adaptability of the proposed model, validating its robustness against a wide range of deepfake detection models, being a promising artificial intelligence tool for DeepFake detection.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104375"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143885974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anh-Khoa Nguyen Vu , Thanh-Toan Do , Vinh-Tiep Nguyen , Tam Le , Minh-Triet Tran , Tam V. Nguyen
{"title":"Few-shot object detection via synthetic features with optimal transport","authors":"Anh-Khoa Nguyen Vu , Thanh-Toan Do , Vinh-Tiep Nguyen , Tam Le , Minh-Triet Tran , Tam V. Nguyen","doi":"10.1016/j.cviu.2025.104350","DOIUrl":"10.1016/j.cviu.2025.104350","url":null,"abstract":"<div><div>Few-shot object detection aims to simultaneously localize and classify the objects in an image with limited training samples. Most existing few-shot object detection methods focus on extracting the features of a few samples of novel classes, which can lack diversity. Consequently, they may not sufficiently capture the data distribution. To address this limitation, we propose a novel approach that trains a generator to produce synthetic data for novel classes. Still, directly training a generator on the novel class is ineffective due to the scarcity of novel data. To overcome this issue, we leverage the large-scale dataset of base classes by training a generator that captures the data variations of the dataset. Specifically, we train the generator with an optimal transport loss that minimizes the distance between the real and synthetic data distributions, which encourages the generator to capture data variations in base classes. We then transfer the captured variations to novel classes by generating synthetic data with the trained generator. Extensive experiments on benchmark datasets demonstrate that the proposed method outperforms the state of the art.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104350"},"PeriodicalIF":4.3,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A vector quantized masked autoencoder for audiovisual speech emotion recognition","authors":"Samir Sadok , Simon Leglaive, Renaud Séguier","doi":"10.1016/j.cviu.2025.104362","DOIUrl":"10.1016/j.cviu.2025.104362","url":null,"abstract":"<div><div>An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104362"},"PeriodicalIF":4.3,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143852177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Cai , Ran Tao , Xiufen Fang , Xiurui Xie , Guisong Liu
{"title":"A deep reinforcement active learning method for multi-label image classification","authors":"Qing Cai , Ran Tao , Xiufen Fang , Xiurui Xie , Guisong Liu","doi":"10.1016/j.cviu.2025.104351","DOIUrl":"10.1016/j.cviu.2025.104351","url":null,"abstract":"<div><div>Active learning is a widely used method for addressing the high cost of sample labeling in deep learning models and has achieved significant success in recent years. However, most existing active learning methods only focus on single-label image classification and have limited application in the context of multi-label images. To address this issue, we propose a novel, multi-label active learning approach based on a reinforcement learning strategy. The proposed approach introduces a reinforcement active learning framework that accounts for the expected error reduction in multi-label images, making it adaptable to multi-label classification models. Additionally, we develop a multi-label reinforcement active learning module (MLRAL), which employs an actor-critic strategy and proximal policy optimization algorithm (PPO). Our state and reward functions consider multi-label correlations to accurately evaluate the potential impact of unlabeled samples on the current model state. We conduct experiments on various multi-label image classification tasks, including the VOC 2007, MS-COCO, NUS-WIDE and ODIR. We also compare our method with multiple classification models, and experimental results show that our method outperforms existing approaches on various tasks, demonstrating the superiority and effectiveness of the proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104351"},"PeriodicalIF":4.3,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure perception and edge refinement network for monocular depth estimation","authors":"Shuangquan Zuo , Yun Xiao , Xuanhong Wang , Hao Lv , Hongwei Chen","doi":"10.1016/j.cviu.2025.104348","DOIUrl":"10.1016/j.cviu.2025.104348","url":null,"abstract":"<div><div>Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicting dense pixel depths from a single RGB image remains challenging due to the ill-posed issues and inherent ambiguity. Two unresolved issues persist: (1) Depth features are limited in perceiving the scene structure accurately, leading to inaccurate region estimation. (2) Low-level features, which are rich in details, are not fully utilized, causing the missing of details and ambiguous edges. The crux to accurate dense depth restoration is to efficiently handle global scene structure as well as local details. To solve these two issues, we propose the Scene perception and Edge refinement network for Monocular Depth Estimation (SE-MDE). Specifically, we carefully design a depth-enhanced encoder (DEE) to effectively perceive the overall structure of the scene while refining the feature responses of different regions. Meanwhile, we introduce a dense edge-guided network (DENet) that maximizes the utilization of low-level features to enhance the depth of details and edges. Extensive experiments validate the effectiveness of our method, with several experimental results on the NYU v2 indoor dataset and KITTI outdoor dataset demonstrate the state-of-the-art performance of the proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104348"},"PeriodicalIF":4.3,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143815364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Si , Zhaolin Zheng , Zhewei Huang , Xi-Ming Xu , Ruijue Wang , Ji-Gang Bao , Qiang Xiong , Xiantong Zhen , Jun Xu
{"title":"Learning temporal-aware representation for controllable interventional radiology imaging","authors":"Wei Si , Zhaolin Zheng , Zhewei Huang , Xi-Ming Xu , Ruijue Wang , Ji-Gang Bao , Qiang Xiong , Xiantong Zhen , Jun Xu","doi":"10.1016/j.cviu.2025.104360","DOIUrl":"10.1016/j.cviu.2025.104360","url":null,"abstract":"<div><div>Interventional Radiology Imaging (IRI) is essential for evaluating cerebral vascular anatomy by providing sequential images of both arterial and venous blood flow. In IRI, the low frame rate (4 fps) during acquisition can lead to discontinuities and flickering, whereas higher frame rates are associated with increased radiation exposure. Nevertheless, under complex blood flow conditions, it becomes necessary to increase the frame rate to 15 fps for the second sampling. Previous methods relied solely on fixed frame interpolation to mitigate discontinuities and flicker. However, owing to frame rate constraints, they were ineffective in addressing the high radiation issues arising from complex blood flow conditions. In this study, we introduce a novel approach called Temporally Controllable Network (TCNet), which innovatively applies controllable frame interpolation techniques to IRI for the first time. Our method effectively tackles the issues of discontinuity and flickering arising from low frame rates and mitigates the radiation concerns linked to higher frame rates during second sampling. Our method emphasizes synthesizing intermediate frame features via a Temporal-Aware Representation Learning (TARL) module and optimizes this process through bilateral optical flow supervision for accurate optical flow estimation. Additionally, to enhance the depiction of blood vessel motion and breathing nuances, we introduce an implicit function module for refining motion cues in videos. Our experiments reveal that TCNet successfully generate videos at clinically appropriate frame rates, significantly improving the reconstruction of blood flow and respiratory patterns. We will publicly release our code and datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104360"},"PeriodicalIF":4.3,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extensions in channel and class dimensions for attention-based knowledge distillation","authors":"Liangtai Zhou, Weiwei Zhang, Banghui Zhang, Yufeng Guo, Junhuang Wang, Xiaobin Li, Jianqing Zhu","doi":"10.1016/j.cviu.2025.104359","DOIUrl":"10.1016/j.cviu.2025.104359","url":null,"abstract":"<div><div>As knowledge distillation technology evolves, it has bifurcated into three distinct methodologies: logic-based, feature-based, and attention-based knowledge distillation. Although the principle of attention-based knowledge distillation is more intuitive, its performance lags behind the other two methods. To address this, we systematically analyze the advantages and limitations of traditional attention-based methods. In order to optimize these limitations and explore more effective attention information, we expand attention-based knowledge distillation in the channel and class dimensions, proposing Spatial Attention-based Knowledge Distillation with Channel Attention (SAKD-Channel) and Spatial Attention-based Knowledge Distillation with Class Attention (SAKD-Class). On CIFAR-100, with ResNet8<span><math><mo>×</mo></math></span>4 as the student model, SAKD-Channel improves Top-1 validation accuracy by 1.98%, and SAKD-Class improves it by 3.35% compared to traditional distillation methods. On ImageNet, using ResNet18, these two methods improve Top-1 validation accuracy by 0.55% and 0.17%, respectively, over traditional methods. We also conduct extensive experiments to investigate the working mechanisms and application conditions of channel and class dimensions knowledge distillation, providing new theoretical insights for attention-based knowledge transfer.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104359"},"PeriodicalIF":4.3,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143838732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}