{"title":"Recursive Confidence Training for Pseudo-Labeling Calibration in Semi-Supervised Few-Shot Learning","authors":"Kunlei Jing;Hebo Ma;Chen Zhang;Lei Wen;Zhaorui Zhang","doi":"10.1109/TIP.2025.3569196","DOIUrl":"10.1109/TIP.2025.3569196","url":null,"abstract":"Semi-Supervised Few-Shot Learning (SSFSL) aims to address the data scarcity in few-shot learning by leveraging both a few labeled support data and abundant unlabeled data. In SSFSL, a classifier trained on scarce support data is often biased and thus assigns inaccurate pseudo-labels to the unlabeled data, which will mislead downstream learning tasks. To combat this issue, we introduce a novel method called Certainty-Aware Recursive Confidence Training (CARCT). CARCT hinges on the insight that selecting pseudo-labeled data based on confidence levels can yield more informative support data, which is crucial for retraining an unbiased classifier to achieve accurate pseudo-labeling—a process we term pseudo-labeling calibration. We observe that accurate pseudo-labels typically exhibit smaller certainty entropy, indicating high-confidence pseudo-labeling compared to those of inaccurate pseudo-labels. Accordingly, CARCT constructs a joint double-Gaussian model to fit the certainty entropies collected across numerous SSFSL tasks. Thereby, A semi-supervised Prior Confidence Distribution (ssPCD) is learned to aid in distinguishing between high-confidence and low-confidence pseudo-labels. During an SSFSL task, ssPCD guides the selection of both high-confidence and low-confidence pseudo-labeled data to retrain the classifier that then assigns more accurate pseudo-labels to the low-confidence pseudo-labeled data. Such recursive confidence training continues until the low-confidence ones are exhausted, terminating the pseudo-labeling calibration. The unlabeled data all receive accurate pseudo-labels to expand the few support data to generalize the downstream learning task, which in return meta-refines the classifier, named self-training, to boost the pseudo-labeling in subsequent tasks. Extensive experiments on basic and extended SSFSL setups showcase the superiority of CARCT versus state-of-the-art methods, and comprehensive ablation studies and visualizations justify our insight. The source code is available at <uri>https://github.com/Klein-JING/CARCT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3194-3208"},"PeriodicalIF":0.0,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144067128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-Source Frequency Transform for Cross-Scene Classification of Hyperspectral Image","authors":"Xizeng Huang;Yanni Dong;Yuxiang Zhang;Bo Du","doi":"10.1109/TIP.2025.3568749","DOIUrl":"10.1109/TIP.2025.3568749","url":null,"abstract":"Currently, the research on cross-scene classification of hyperspectral image (HSI) based on domain generalization (DG) has received wider attention. The majority of the existing methods achieve cross-scene classification of HSI via data manipulation that generates more feature-rich samples. The insufficient mining of complex features of HSIs in these methods leads to limiting the effectiveness of the newly generated HSI samples. Therefore, in this paper, we propose a novel single-source frequency transform (SFT), which realizes domain generalization by transforming the frequency features of samples, mainly including frequency transform (FT) and balanced attentional consistency (BAC). Firstly, FT is designed to learn dynamic attention maps in the frequency space of samples filtering frequency components to improve the diversity of features in new samples. Moreover, BAC is designed based on the class activation map to improve the reliability of newly generated samples. Comprehensive experiments on three public HSI datasets demonstrate that the proposed method outperforms the state-of-the-art method, with accuracy at most 5.14% higher than the second place.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3000-3012"},"PeriodicalIF":0.0,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144065843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CmdVIT: A Voluntary Facial Expression Recognition Model for Complex Mental Disorders","authors":"Jiayu Ye;Yanhong Yu;Qingxiang Wang;Guolong Liu;Wentao Li;An Zeng;Yiqun Zhang;Yang Liu;Yunshao Zheng","doi":"10.1109/TIP.2025.3567825","DOIUrl":"10.1109/TIP.2025.3567825","url":null,"abstract":"Facial Expression Recognition (FER) is a critical method for evaluating the emotional states of patients with mental disorders, playing a significant role in treatment monitoring. However, due to privacy constraints, facial expression data from patients with mental disorders is severely limited. Additionally, the more complex inter-class and intra-class similarities compared to healthy individuals make accurate recognition of facial expressions challenging. Therefore, we propose a Voluntary Facial Expression Mimicry (VFEM) experiment, which collected facial expression data from schizophrenia, depression, and anxiety. This experiment establishes the first dataset designed for facial expression recognition tasks exclusively composed of patients with mental disorders. Simultaneously, based on VFEM, we propose a Vision Transformer FER model tailored for Complex mental disorder patients (CmdVIT). CmdVIT integrates crucial facial expression features through both explicit and implicit mechanisms, including explicit visual center positional encoding and implicit sparse attention center loss function. These two key components enhance positional information and minimize the facial feature space distance between conventional attention and critical attention, effectively suppressing inter-class and intra-class similarities. In various FER tasks for different mental disorders in VFEM, CmdVIT achieves more competitive performance compared to contemporary benchmark models. Our works are available at <uri>https://github.com/yjy-97/CmdVIT</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3013-3024"},"PeriodicalIF":0.0,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143979611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihe Lu;Jiawang Bai;Xin Li;Zeyu Xiao;Xinchao Wang
{"title":"Task-to-Instance Prompt Learning for Vision-Language Models at Test Time","authors":"Zhihe Lu;Jiawang Bai;Xin Li;Zeyu Xiao;Xinchao Wang","doi":"10.1109/TIP.2025.3546840","DOIUrl":"10.1109/TIP.2025.3546840","url":null,"abstract":"Prompt learning has been recently introduced into the adaption of pre-trained vision-language models (VLMs) by tuning a set of trainable tokens to replace hand-crafted text templates. Despite the encouraging results achieved, existing methods largely rely on extra annotated data for training. In this paper, we investigate a more realistic scenario, where only the unlabeled test data is available. Existing test-time prompt learning methods often separately learn a prompt for each test sample. However, relying solely on a single sample heavily limits the performance of the learned prompts, as it neglects the task-level knowledge that can be gained from multiple samples. To that end, we propose a novel test-time prompt learning method of VLMs, called Task-to-Instance PromPt LEarning (TIPPLE), which adopts a two-stage training strategy to leverage both task- and instance-level knowledge. Specifically, we reformulate the effective online pseudo-labeling paradigm along with two tailored components: an auxiliary text classification task and a diversity regularization term, to serve the task-oriented prompt learning. After that, the learned task-level prompt is further combined with a tunable residual for each test sample to integrate with instance-level knowledge. We demonstrate the superior performance of TIPPLE on 15 downstream datasets, e.g., the average improvement of 1.87% over the state-of-the-art method, using ViT-B/16 visual backbone. Our code is open-sourced at <uri>https://github.com/zhiheLu/TIPPLE</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1908-1920"},"PeriodicalIF":0.0,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143631135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangli Zeng;Zhongyuan Wang;Tao Lu;Jianyu Chen;Chao Liang;Zhen Han
{"title":"Multi-Stage Statistical Texture-Guided GAN for Tilted Face Frontalization","authors":"Kangli Zeng;Zhongyuan Wang;Tao Lu;Jianyu Chen;Chao Liang;Zhen Han","doi":"10.1109/TIP.2025.3548896","DOIUrl":"10.1109/TIP.2025.3548896","url":null,"abstract":"Existing pose-invariant face recognition mainly focuses on frontal or profile, whereas high-pitch angle face recognition, prevalent under surveillance videos, has yet to be investigated. More importantly, tilted faces significantly differ from frontal or profile faces in the potential feature space due to self-occlusion, thus seriously affecting key feature extraction for face recognition. In this paper, we asymptotically reshape challenging high-pitch angle faces into a series of small-angle approximate frontal faces and exploit a statistical approach to learn texture features to ensure accurate facial component generation. In particular, we design a statistical texture-guided GAN for tilted face frontalization (STG-GAN) consisting of three main components. First, the face encoder extracts shallow features, followed by the face statistical texture modeling module that learns multi-scale face texture features based on the statistical distributions of the shallow features. Then, the face decoder performs feature deformation guided by the face statistical texture features while highlighting the pose-invariant face discriminative information. With the addition of multi-scale content loss, identity loss and adversarial loss, we further develop a pose contrastive loss of potential spatial features to constrain pose consistency and make its face frontalization process more reliable. On this basis, we propose a divide-and-conquer strategy, using STG-GAN to progressively synthesize faces with small pitch angles in multiple stages to achieve frontalization gradually. A unified end-to-end training across multiple stages facilitates the generation of numerous intermediate results to achieve a reasonable approximation of the ground truth. Extensive qualitative and quantitative experiments on multiple-face datasets demonstrate the superiority of our approach.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1726-1736"},"PeriodicalIF":0.0,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanbo Gao;Shuai Li;Meng Fu;Chong Lv;Zhiyuan Yang;Xun Cai;Hui Yuan;Mao Ye
{"title":"Approximately Invertible Neural Network for Learned Image Compression","authors":"Yanbo Gao;Shuai Li;Meng Fu;Chong Lv;Zhiyuan Yang;Xun Cai;Hui Yuan;Mao Ye","doi":"10.1109/TIP.2025.3567830","DOIUrl":"10.1109/TIP.2025.3567830","url":null,"abstract":"Learned image compression has attracted considerable interests in recent years. An analysis transform and a synthesis transform, which can be regarded as coupled transforms, are used to encode an image to latent feature and decode the feature after quantization to reconstruct the image. Inspired by the success of invertible neural networks in generative modeling, invertible modules can be used to construct the coupled analysis and synthesis transforms. Considering the noise introduced in the feature quantization invalidates the invertible process, this paper proposes an Approximately Invertible Neural Network (A-INN) framework for learned image compression. It formulates the rate-distortion optimization in lossy image compression when using INN with quantization, which differentiates from using INN for generative modelling. Generally speaking, A-INN can be used as the theoretical foundation for any INN based lossy compression method. Based on this formulation, A-INN with a progressive denoising module (PDM) is developed to effectively reduce the quantization noise in the decoding. Moreover, a Cascaded Feature Recovery Module (CFRM) is designed to learn high-dimensional feature recovery from low-dimensional ones to further reduce the noise in feature channel compression. In addition, a Frequency-enhanced Decomposition and Synthesis Module (FDSM) is developed by explicitly enhancing the high-frequency components in an image to address the loss of high-frequency information inherent in neural network based image compression, thereby enhancing the reconstructed image quality. Extensive experiments demonstrate that the proposed A-INN framework achieves better or comparable compression efficiency than the conventional image compression approach and state-of-the-art learned image compression methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3041-3055"},"PeriodicalIF":0.0,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chubin Zhang;Juncheng Yan;Yi Wei;Jiaxin Li;Li Liu;Yansong Tang;Yueqi Duan;Jiwen Lu
{"title":"OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments","authors":"Chubin Zhang;Juncheng Yan;Yi Wei;Jiaxin Li;Li Liu;Yansong Tang;Yueqi Duan;Jiwen Lu","doi":"10.1109/TIP.2025.3567828","DOIUrl":"10.1109/TIP.2025.3567828","url":null,"abstract":"Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D ground truth. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras’ infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method. The code is available at <uri>https://github.com/LinShan-Bin/OccNeRF</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3096-3107"},"PeriodicalIF":0.0,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization","authors":"Qi Bi;Wei Ji;Jingjun Yi;Haolan Zhan;Gui-Song Xia","doi":"10.1109/TIP.2025.3567834","DOIUrl":"10.1109/TIP.2025.3567834","url":null,"abstract":"High-quality annotation of fine-grained visual categories demands great expert knowledge, which is taxing and time consuming. Alternatively, learning fine-grained visual representation from enormous unlabeled images (e.g., species, brands) by self-supervised learning becomes a feasible solution. However, recent investigations find that existing self-supervised learning methods are less qualified to represent fine-grained categories. The bottleneck lies in that the pre-trained class-agnostic representation is built from every patch-wise embedding, while fine-grained categories are only determined by several key patches of an image. In this paper, we propose a Cross-level Multi-instance Distillation (CMD) framework to tackle this challenge. Our key idea is to consider the importance of each image patch in determining the fine-grained representation by multiple instance learning. To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation. Extensive experiments on several commonly used datasets, including CUB-200-2011, Stanford Cars and FGVC Aircraft, demonstrate that the proposed method outperforms the contemporary methods by up to 10.14% and existing state-of-the-art self-supervised learning approaches by up to 19.78% on both top-1 accuracy and Rank-1 retrieval metric. Source code is available at <uri>https://github.com/BiQiWHU/CMD</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2954-2969"},"PeriodicalIF":0.0,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143946353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPAC: Sampling-Based Progressive Attribute Compression for Dense Point Clouds","authors":"Xiaolong Mao;Hui Yuan;Tian Guo;Shiqi Jiang;Raouf Hamzaoui;Sam Kwong","doi":"10.1109/TIP.2025.3565214","DOIUrl":"10.1109/TIP.2025.3565214","url":null,"abstract":"We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the best anchor of the latest geometry-based point cloud compression (G-PCC) standard that was proposed by the Moving Picture Experts Group (MPEG), the proposed method can achieve an average Bjøntegaard delta bitrate of -24.58% for the Y component (resp. -21.23% for YUV components) on the MPEG Category Solid dataset and -22.48% for the Y component (resp. -17.19% for YUV components) on the MPEG Category Dense dataset. This is the first instance that a learning-based attribute codec outperforms the G-PCC standard on these datasets by following the common test conditions specified by MPEG. Our source code will be made publicly available on <uri>https://github.com/sduxlmao/SPAC</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2939-2953"},"PeriodicalIF":0.0,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143939701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bolin Chen;Zhao Wang;Binzhe Li;Shurun Wang;Shiqi Wang;Yan Ye
{"title":"Interactive Face Video Coding: A Generative Compression Framework","authors":"Bolin Chen;Zhao Wang;Binzhe Li;Shurun Wang;Shiqi Wang;Yan Ye","doi":"10.1109/TIP.2025.3563762","DOIUrl":"10.1109/TIP.2025.3563762","url":null,"abstract":"In this paper, we propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals. The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression/headpose animation. In particular, we propose the Internal Dimension Increase (IDI) based representation, greatly enhancing the fidelity and flexibility in rendering the appearance while maintaining reasonable representation cost. By leveraging strong statistical regularities, the visual signals can be effectively projected into controllable semantics in the three dimensional space (e.g., mouth motion, eye blinking, head rotation, head translation and head location), which are compressed and transmitted. The editable bitstream, which naturally supports the interactivity at the semantic level, can synthesize the face frames via the strong inference ability of the deep generative model. Experimental results have demonstrated the performance superiority and application prospects of our proposed IFVC scheme. In particular, the proposed scheme not only outperforms the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes in terms of rate-distortion performance for face videos, but also enables the interactive coding without introducing additional manipulation processes. Furthermore, the proposed framework is expected to shed lights on the future design of the digital human communication in the metaverse. The project page can be found at <uri>https://github.com/Berlin0610/Interactive_Face_Video_Coding</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2910-2925"},"PeriodicalIF":0.0,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143939757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}