Ting Cai , Yu Xiong , Chengyang He , Chao Wu , Linqin Cai
{"title":"Classroom teacher behavior analysis: The TBU dataset and performance evaluation","authors":"Ting Cai , Yu Xiong , Chengyang He , Chao Wu , Linqin Cai","doi":"10.1016/j.cviu.2025.104376","DOIUrl":"10.1016/j.cviu.2025.104376","url":null,"abstract":"<div><div>Classroom videos are objective records of teaching behaviors, which provide evidence for teachers’ teaching reflection and evaluation. The intelligent identification, tracking and description of teacher teaching behavior based on classroom videos have become a research hotspot in the field of intelligent education to understand the teaching process of teachers. Although the recent attempts propose several promising directions for the analysis of teaching behavior, the existing public datasets are still insufficient to meet the need for these potential solutions due to lack of varied classroom environment, fine-grained teaching scene behavior data. To address this, we analyzed the influencing factors of teacher behavior and related video datasets, and constructed a diverse, scenario-specific, and multi-task dataset named TBU for Teacher Behavior Understanding. The TBU contains 37,026 high-quality teaching behavior clips, 9422 annotated teaching behavior clips with precise time boundaries, and 6098 teacher teaching behavior description clips annotated with multi-level atomic action labels of fine-grained behavior, spatial location, and interactive objects in four education stages. We performed a comprehensive statistical analysis of TBU and summarized the behavioral characteristics of teachers at different educational stages. Additionally, we systematically investigated representative methods for three video understanding tasks on TBU: behavior recognition, behavior detection, and behavior description, providing a benchmark for the research towards a more comprehensive understanding of teaching video data. Considering the specificity of classroom scenarios and the needs of teaching behavior analysis, we put forward new requirements for the existing baseline methods. We believe that TBU can facilitate in-depth research on classroom teacher teaching video analysis. TBU is available at: <span><span>https://github.com/cai-KU/TBU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104376"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143885973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolutional neural network framework for deepfake detection: A diffusion-based approach","authors":"Emmanuel Pintelas , Ioannis E. Livieris","doi":"10.1016/j.cviu.2025.104375","DOIUrl":"10.1016/j.cviu.2025.104375","url":null,"abstract":"<div><div>In the rapidly advancing domain of synthetic media, DeepFakes emerged as a potent tool for misinformation and manipulation. Nevertheless, the engineering challenge lies in detecting such content to ensure information integrity. Recent artificial intelligence contributions in deepfake detection have mainly concentrated around sophisticated convolutional neural network models, which derive insights from facial biometrics, including multi-attentional and multi-view mechanisms, pairwise/siamese, distillation learning technique and facial-geometry approaches. In this work, we consider a new diffusion-based neural network approach, rather than directly analyzing deepfake images for inconsistencies. Motivated by the considerable property of diffusion procedure of unveiling anomalies, we employ diffusion of the inherent structure of deepfake images, seeking for patterns throughout this process. Specifically, the proposed diffusion network, iteratively adds noise to the input image until it almost becomes pure noise. Subsequently, a convolutional neural network extracts features from the final diffused state, as well as from all transient states of the diffusion process. The comprehensive experimental analysis demonstrates the efficacy and adaptability of the proposed model, validating its robustness against a wide range of deepfake detection models, being a promising artificial intelligence tool for DeepFake detection.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104375"},"PeriodicalIF":4.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143885974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anh-Khoa Nguyen Vu , Thanh-Toan Do , Vinh-Tiep Nguyen , Tam Le , Minh-Triet Tran , Tam V. Nguyen
{"title":"Few-shot object detection via synthetic features with optimal transport","authors":"Anh-Khoa Nguyen Vu , Thanh-Toan Do , Vinh-Tiep Nguyen , Tam Le , Minh-Triet Tran , Tam V. Nguyen","doi":"10.1016/j.cviu.2025.104350","DOIUrl":"10.1016/j.cviu.2025.104350","url":null,"abstract":"<div><div>Few-shot object detection aims to simultaneously localize and classify the objects in an image with limited training samples. Most existing few-shot object detection methods focus on extracting the features of a few samples of novel classes, which can lack diversity. Consequently, they may not sufficiently capture the data distribution. To address this limitation, we propose a novel approach that trains a generator to produce synthetic data for novel classes. Still, directly training a generator on the novel class is ineffective due to the scarcity of novel data. To overcome this issue, we leverage the large-scale dataset of base classes by training a generator that captures the data variations of the dataset. Specifically, we train the generator with an optimal transport loss that minimizes the distance between the real and synthetic data distributions, which encourages the generator to capture data variations in base classes. We then transfer the captured variations to novel classes by generating synthetic data with the trained generator. Extensive experiments on benchmark datasets demonstrate that the proposed method outperforms the state of the art.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104350"},"PeriodicalIF":4.3,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A vector quantized masked autoencoder for audiovisual speech emotion recognition","authors":"Samir Sadok , Simon Leglaive, Renaud Séguier","doi":"10.1016/j.cviu.2025.104362","DOIUrl":"10.1016/j.cviu.2025.104362","url":null,"abstract":"<div><div>An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104362"},"PeriodicalIF":4.3,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143852177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Cai , Ran Tao , Xiufen Fang , Xiurui Xie , Guisong Liu
{"title":"A deep reinforcement active learning method for multi-label image classification","authors":"Qing Cai , Ran Tao , Xiufen Fang , Xiurui Xie , Guisong Liu","doi":"10.1016/j.cviu.2025.104351","DOIUrl":"10.1016/j.cviu.2025.104351","url":null,"abstract":"<div><div>Active learning is a widely used method for addressing the high cost of sample labeling in deep learning models and has achieved significant success in recent years. However, most existing active learning methods only focus on single-label image classification and have limited application in the context of multi-label images. To address this issue, we propose a novel, multi-label active learning approach based on a reinforcement learning strategy. The proposed approach introduces a reinforcement active learning framework that accounts for the expected error reduction in multi-label images, making it adaptable to multi-label classification models. Additionally, we develop a multi-label reinforcement active learning module (MLRAL), which employs an actor-critic strategy and proximal policy optimization algorithm (PPO). Our state and reward functions consider multi-label correlations to accurately evaluate the potential impact of unlabeled samples on the current model state. We conduct experiments on various multi-label image classification tasks, including the VOC 2007, MS-COCO, NUS-WIDE and ODIR. We also compare our method with multiple classification models, and experimental results show that our method outperforms existing approaches on various tasks, demonstrating the superiority and effectiveness of the proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104351"},"PeriodicalIF":4.3,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure perception and edge refinement network for monocular depth estimation","authors":"Shuangquan Zuo , Yun Xiao , Xuanhong Wang , Hao Lv , Hongwei Chen","doi":"10.1016/j.cviu.2025.104348","DOIUrl":"10.1016/j.cviu.2025.104348","url":null,"abstract":"<div><div>Monocular depth estimation is fundamental for scene understanding and visual downstream tasks. In recent years, with the development of deep learning, increasing complex networks and powerful mechanisms have significantly improved the performance of monocular depth estimation. Nevertheless, predicting dense pixel depths from a single RGB image remains challenging due to the ill-posed issues and inherent ambiguity. Two unresolved issues persist: (1) Depth features are limited in perceiving the scene structure accurately, leading to inaccurate region estimation. (2) Low-level features, which are rich in details, are not fully utilized, causing the missing of details and ambiguous edges. The crux to accurate dense depth restoration is to efficiently handle global scene structure as well as local details. To solve these two issues, we propose the Scene perception and Edge refinement network for Monocular Depth Estimation (SE-MDE). Specifically, we carefully design a depth-enhanced encoder (DEE) to effectively perceive the overall structure of the scene while refining the feature responses of different regions. Meanwhile, we introduce a dense edge-guided network (DENet) that maximizes the utilization of low-level features to enhance the depth of details and edges. Extensive experiments validate the effectiveness of our method, with several experimental results on the NYU v2 indoor dataset and KITTI outdoor dataset demonstrate the state-of-the-art performance of the proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104348"},"PeriodicalIF":4.3,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143815364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Si , Zhaolin Zheng , Zhewei Huang , Xi-Ming Xu , Ruijue Wang , Ji-Gang Bao , Qiang Xiong , Xiantong Zhen , Jun Xu
{"title":"Learning temporal-aware representation for controllable interventional radiology imaging","authors":"Wei Si , Zhaolin Zheng , Zhewei Huang , Xi-Ming Xu , Ruijue Wang , Ji-Gang Bao , Qiang Xiong , Xiantong Zhen , Jun Xu","doi":"10.1016/j.cviu.2025.104360","DOIUrl":"10.1016/j.cviu.2025.104360","url":null,"abstract":"<div><div>Interventional Radiology Imaging (IRI) is essential for evaluating cerebral vascular anatomy by providing sequential images of both arterial and venous blood flow. In IRI, the low frame rate (4 fps) during acquisition can lead to discontinuities and flickering, whereas higher frame rates are associated with increased radiation exposure. Nevertheless, under complex blood flow conditions, it becomes necessary to increase the frame rate to 15 fps for the second sampling. Previous methods relied solely on fixed frame interpolation to mitigate discontinuities and flicker. However, owing to frame rate constraints, they were ineffective in addressing the high radiation issues arising from complex blood flow conditions. In this study, we introduce a novel approach called Temporally Controllable Network (TCNet), which innovatively applies controllable frame interpolation techniques to IRI for the first time. Our method effectively tackles the issues of discontinuity and flickering arising from low frame rates and mitigates the radiation concerns linked to higher frame rates during second sampling. Our method emphasizes synthesizing intermediate frame features via a Temporal-Aware Representation Learning (TARL) module and optimizes this process through bilateral optical flow supervision for accurate optical flow estimation. Additionally, to enhance the depiction of blood vessel motion and breathing nuances, we introduce an implicit function module for refining motion cues in videos. Our experiments reveal that TCNet successfully generate videos at clinically appropriate frame rates, significantly improving the reconstruction of blood flow and respiratory patterns. We will publicly release our code and datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104360"},"PeriodicalIF":4.3,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143834164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extensions in channel and class dimensions for attention-based knowledge distillation","authors":"Liangtai Zhou, Weiwei Zhang, Banghui Zhang, Yufeng Guo, Junhuang Wang, Xiaobin Li, Jianqing Zhu","doi":"10.1016/j.cviu.2025.104359","DOIUrl":"10.1016/j.cviu.2025.104359","url":null,"abstract":"<div><div>As knowledge distillation technology evolves, it has bifurcated into three distinct methodologies: logic-based, feature-based, and attention-based knowledge distillation. Although the principle of attention-based knowledge distillation is more intuitive, its performance lags behind the other two methods. To address this, we systematically analyze the advantages and limitations of traditional attention-based methods. In order to optimize these limitations and explore more effective attention information, we expand attention-based knowledge distillation in the channel and class dimensions, proposing Spatial Attention-based Knowledge Distillation with Channel Attention (SAKD-Channel) and Spatial Attention-based Knowledge Distillation with Class Attention (SAKD-Class). On CIFAR-100, with ResNet8<span><math><mo>×</mo></math></span>4 as the student model, SAKD-Channel improves Top-1 validation accuracy by 1.98%, and SAKD-Class improves it by 3.35% compared to traditional distillation methods. On ImageNet, using ResNet18, these two methods improve Top-1 validation accuracy by 0.55% and 0.17%, respectively, over traditional methods. We also conduct extensive experiments to investigate the working mechanisms and application conditions of channel and class dimensions knowledge distillation, providing new theoretical insights for attention-based knowledge transfer.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104359"},"PeriodicalIF":4.3,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143838732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kihyun Na, Junseok Oh , Youngkwan Cho, Bumjin Kim, Sungmin Cho, Jinyoung Choi, Injung Kim
{"title":"MF-LPR2: Multi-frame license plate image restoration and recognition using optical flow","authors":"Kihyun Na, Junseok Oh , Youngkwan Cho, Bumjin Kim, Sungmin Cho, Jinyoung Choi, Injung Kim","doi":"10.1016/j.cviu.2025.104361","DOIUrl":"10.1016/j.cviu.2025.104361","url":null,"abstract":"<div><div>License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>, which addresses ambiguities in poor quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>. The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span> outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span> achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (16.18%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104361"},"PeriodicalIF":4.3,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143786138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim
{"title":"Modality mixer exploiting complementary information for multi-modal action recognition","authors":"Sumin Lee, Sangmin Woo, Muhammad Adi Nugroho, Changick Kim","doi":"10.1016/j.cviu.2025.104358","DOIUrl":"10.1016/j.cviu.2025.104358","url":null,"abstract":"<div><div>Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the complementary nature of different modalities. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, which effectively leverages and incorporates the complementary information across modalities with the temporal context of actions for action recognition. A key component of our proposed M-Mixer is the Multi-modal Contextualization Unit (MCU), a simple yet effective recurrent unit. Our MCU is responsible for temporally encoding a sequence of one modality (<em>e</em>.<em>g</em>., RGB) with action content features of other modalities (<em>e</em>.<em>g</em>., depth and infrared modalities). This process encourages M-Mixer network to exploit global action content and also to supplement complementary information of other modalities. Furthermore, to extract appropriate complementary information regarding to the given modality settings, we introduce a new module, named Complementary Feature Extraction Module (CFEM). CFEM incorporates separate learnable query embeddings for each modality, which guide CFEM to extract complementary information and global action content from the other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, through comprehensive ablation studies, we further validate the effectiveness of our proposed method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"256 ","pages":"Article 104358"},"PeriodicalIF":4.3,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143776693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}