Ji Zhang;Jingkuan Song;Lianli Gao;Nicu Sebe;Heng Tao Shen
{"title":"Reliable Few-Shot Learning Under Dual Noises","authors":"Ji Zhang;Jingkuan Song;Lianli Gao;Nicu Sebe;Heng Tao Shen","doi":"10.1109/TPAMI.2025.3584051","DOIUrl":"10.1109/TPAMI.2025.3584051","url":null,"abstract":"Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target task. Nevertheless, existing approaches may still fail in the open world due to the inevitable <italic>in-distribution (ID)</i> and <italic>out-of-distribution (OOD)</i> noise from both support and query samples of the target task. With limited support samples available, <italic>i</i>) the adverse effect of the dual noises can be severely amplified during task adaptation, and <italic>ii</i>) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose <bold>DE</b>noised <bold>T</b>ask <bold>A</b>daptation (<bold>DETA</b>++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a <italic>clean prototype</i> loss and a <italic>noise entropy maximization</i> loss are proposed to achieve noise-robust task adaptation. Additionally, DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model’s robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9005-9022"},"PeriodicalIF":18.6,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144520674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Liang;Lingyuan Meng;Hao Li;Jun Wang;Long Lan;Miaomiao Li;Xinwang Liu;Huaimin Wang
{"title":"From Concrete to Abstract: Multi-View Clustering on Relational Knowledge","authors":"Ke Liang;Lingyuan Meng;Hao Li;Jun Wang;Long Lan;Miaomiao Li;Xinwang Liu;Huaimin Wang","doi":"10.1109/TPAMI.2025.3582689","DOIUrl":"10.1109/TPAMI.2025.3582689","url":null,"abstract":"Multi-view clustering (MVC) is a fast-growing research direction. However, most existing MVC works focus on concrete objects (e.g., cats, desks) but ignore abstract objects (e.g., knowledge, thoughts), which are also important parts of our daily lives and more correlated to cognition. Relational knowledge, as a typical abstract concept, describes the relationship between entities. For example, “<italic>Cats like eating fishes</i>,” as relational knowledge, reveals the relationship “<italic>eating</i>” between “<italic>cats</i>” and “<italic>fishes</i>.” To fill this gap, we first point out that MVC on relational knowledge is considered an important scenario. Then, we construct <bold>8</b> new datasets to lay research grounds for them. Moreover, a simple yet effective relational knowledge MVC paradigm (RK-MVC) is proposed by compensating the omitted sample-global correlations from the structural knowledge information. Concretely, the basic consensus features are first learned via adopted MVC backbones, and sample-global correlations are generated in both coarse-grained and fine-grained manners. In particular, the sample-global correlation learning module can be easily extended to various MVC backbones. Finally, both basic consensus features and sample-global correlation features are weighted fused as the target consensus feature. We adopt <bold>9</b> typical MVC backbones in this paper for comparison from <bold>7</b> aspects, demonstrating the promising capacity of our RK-MVC.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9043-9060"},"PeriodicalIF":18.6,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144503376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haowei Wang;Jiayi Ji;Tianyu Guo;Yilong Yang;Xiaoshuai Sun;Rongrong Ji
{"title":"NICE: Improving Panoptic Narrative Detection and Segmentation With Cascading Collaborative Learning","authors":"Haowei Wang;Jiayi Ji;Tianyu Guo;Yilong Yang;Xiaoshuai Sun;Rongrong Ji","doi":"10.1109/TPAMI.2025.3583795","DOIUrl":"10.1109/TPAMI.2025.3583795","url":null,"abstract":"Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL’s reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8990-9004"},"PeriodicalIF":18.6,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144503610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Mi;Li Shen;Tianhe Ren;Yiyi Zhou;Tianshuo Xu;Xiaoshuai Sun;Tongliang Liu;Rongrong Ji;Dacheng Tao
{"title":"Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer","authors":"Peng Mi;Li Shen;Tianhe Ren;Yiyi Zhou;Tianshuo Xu;Xiaoshuai Sun;Tongliang Liu;Rongrong Ji;Dacheng Tao","doi":"10.1109/TPAMI.2025.3581310","DOIUrl":"10.1109/TPAMI.2025.3581310","url":null,"abstract":"Deep neural networks often suffer from poor generalization due to complex and non-convex loss landscapes. Sharpness-Aware Minimization (SAM) is a popular solution that smooths the loss landscape by minimizing the maximized change of training loss when adding a perturbation to the weight. However, indiscriminate perturbation of SAM on all parameters is suboptimal and results in excessive computation, double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions based on Fisher information and dynamic sparse training, respectively. We investigate the impact of different masks, including unstructured, structured, and <inline-formula><tex-math>$N$</tex-math></inline-formula>:<inline-formula><tex-math>$M$</tex-math></inline-formula> structured patterns, as well as explicit and implicit forms of implementing sparse perturbation. We theoretically prove that SSAM can converge at the same rate as SAM, i.e., <inline-formula><tex-math>$O(log T/sqrt{T})$</tex-math></inline-formula> . Sparse SAM has the potential to accelerate training and smooth the loss landscape effectively. Extensive experimental results on CIFAR and ImageNet-1K confirm that our method is superior to SAM in terms of efficiency, and the performance is preserved or even improved with a perturbation of merely 50% sparsity.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8538-8549"},"PeriodicalIF":18.6,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144503377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting Essential and Nonessential Settings of Evidential Deep Learning","authors":"Mengyuan Chen;Junyu Gao;Changsheng Xu","doi":"10.1109/TPAMI.2025.3583410","DOIUrl":"10.1109/TPAMI.2025.3583410","url":null,"abstract":"Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despite its success, EDL incorporates several nonessential settings: In model construction, (1) a commonly ignored prior weight parameter is fixed to the number of classes, while its value actually impacts the balance between the proportion of evidence and its magnitude in deriving predictive scores. In model optimization, (2) the empirical risk features a variance-minimizing optimization term that biases the PDF towards a Dirac delta function, potentially exacerbating overconfidence. (3) Additionally, the structural risk typically includes a KL-divergence-minimizing regularization, whose optimization direction extends beyond the intended purpose and contradicts common sense, diminishing the information carried by the evidence magnitude. Therefore, we propose Re-EDL, a simplified yet more effective variant of EDL, by relaxing the nonessential settings and retaining the essential one, namely, the adoption of projected probability from subjective logic. Specifically, Re-EDL treats the prior weight as an adjustable hyperparameter rather than a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided by deprecating both the variance-minimizing optimization term and the divergence regularization term. Extensive experiments and state-of-the-art performance validate the effectiveness of our method.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8658-8673"},"PeriodicalIF":18.6,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144500692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qilang Ye;Zitong Yu;Rui Shao;Yawen Cui;Xiangui Kang;Xin Liu;Philip Torr;Xiaochun Cao
{"title":"CAT+: Investigating and Enhancing Audio-Visual Understanding in Large Language Models","authors":"Qilang Ye;Zitong Yu;Rui Shao;Yawen Cui;Xiangui Kang;Xin Liu;Philip Torr;Xiaochun Cao","doi":"10.1109/TPAMI.2025.3582389","DOIUrl":"10.1109/TPAMI.2025.3582389","url":null,"abstract":"Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual <bold>ambiguity</b>, and 2) audio-visual <bold>hallucination</b>. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the <bold>CAT+</b>, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named <italic>AVHbench</i>. This benchmark detects the extent of MLLM’s hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method. The AVHbench is released at <uri>https://github.com/rikeilong/Bay-CAT</uri>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8674-8690"},"PeriodicalIF":18.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerated Self-Supervised Multi-Illumination Color Constancy With Hybrid Knowledge Distillation","authors":"Ziyu Feng;Bing Li;Congyan Lang;Zheming Xu;Haina Qin;Juan Wang;Weihua Xiong","doi":"10.1109/TPAMI.2025.3583090","DOIUrl":"10.1109/TPAMI.2025.3583090","url":null,"abstract":"Color constancy, the human visual system’s ability to perceive consistent colors under varying illumination conditions, is crucial for accurate color perception. Recently, deep learning algorithms have been introduced into this task and have achieved remarkable achievements. However, existing methods are limited by the scale of current multi-illumination datasets and model size, hindering their ability to learn discriminative features effectively and their practical value for deployment in cameras. To overcome these limitations, this paper proposes a multi-illumination color constancy approach based on self-supervised learning and knowledge distillation. This approach includes three phases: self-supervised pre-training, supervised fine-tuning, and knowledge distillation. During the pre-training phase, we train Transformer-based and U-Net based encoders by two pretext tasks: light normalization task to learn lighting color contextual representation and grayscale colorization task to acquire objects’ inherent color information. For the downstream color constancy task, we fine-tune the encoders and design a lightweight decoder to obtain better illumination distributions with fewer parameters. During the knowledge distillation phase, we introduce a hybrid knowledge distillation technique to align CNN features with those of Transformer and U-Net respectively. Our proposed method outperforms state-of-the-art techniques on multi-illumination and single-illumination benchmarks. Extensive ablation studies and visualizations confirm the effectiveness of our model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8955-8972"},"PeriodicalIF":18.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Trade-Off Between Flatness and Optimization in Distributed Learning","authors":"Ying Cao;Zhaoxian Wu;Kun Yuan;Ali H. Sayed","doi":"10.1109/TPAMI.2025.3583104","DOIUrl":"10.1109/TPAMI.2025.3583104","url":null,"abstract":"This paper proposes a theoretical framework to evaluate and compare the performance of stochastic gradient algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers three interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minima and favor convergence toward flatter minima relative to the centralized solution. Second, in decentralized methods, the consensus strategy has a worse excess-risk performance than diffusion, giving it a better chance of escaping from local minima and favoring flatter minima. Third, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimum but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. In this regard, since diffusion has a lower excess-risk than consensus, when both algorithms are trained starting from random initial points, diffusion enhances the classification accuracy. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies deliver in general enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance compared to the centralized solution.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8873-8888"},"PeriodicalIF":18.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Quality Pseudo-Labeling for Point Cloud Segmentation With Scene-Level Annotation","authors":"Lunhao Duan;Shanshan Zhao;Xingxing Weng;Jing Zhang;Gui-Song Xia","doi":"10.1109/TPAMI.2025.3583071","DOIUrl":"10.1109/TPAMI.2025.3583071","url":null,"abstract":"This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach’s individual components.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9360-9366"},"PeriodicalIF":18.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144488047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"H-Calibration: Rethinking Classifier Recalibration With Probabilistic Error-Bounded Objective","authors":"Wenjian Huang;Guiping Cao;Jiahao Xia;Jingkun Chen;Hao Wang;Jianguo Zhang","doi":"10.1109/TPAMI.2025.3582796","DOIUrl":"10.1109/TPAMI.2025.3582796","url":null,"abstract":"Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called <inline-formula><tex-math>$h$</tex-math></inline-formula>-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9023-9042"},"PeriodicalIF":18.6,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144479303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}