{"title":"CCDPlus: Towards Accurate Character to Character Distillation for Text Recognition","authors":"Tongkun Guan;Wei Shen;Xiaokang Yang","doi":"10.1109/TPAMI.2025.3533737","DOIUrl":"10.1109/TPAMI.2025.3533737","url":null,"abstract":"Existing scene text recognition methods leverage large-scale labeled synthetic data (LSD) to reduce reliance on labor-intensive annotation tasks and improve recognition capability in real-world scenarios. However, the emergence of a synth-to-real domain gap still limits their efficiency and robustness. Consequently, harvesting the meaningful intrinsic qualities of unlabeled real data (URD) is of great importance, given the prevalence of text-laden images. Toward the target, recent efforts have focused on pre-training on URD through sequence-to-sequence self-supervised learning, followed by fine-tuning on LSD via supervised learning. Nevertheless, they encounter three important issues: coarse representation learning units, inflexible data augmentation, and an emerging real-to-synth domain drift. To overcome these challenges, we propose CCDPlus, an accurate character-to-character distillation method for scene text recognition with a joint supervised and self-supervised learning framework. Specifically, tailored for text images, CCDPlus delineates the fine-grained character structures on URD as representation units by transferring knowledge learned from LSD online. Without requiring extra bounding box or pixel-level annotations, this process allows CCDPlus to enable character-to-character distillation flexibly with versatile data augmentation, which effectively extracts general real-world character-level feature representations. Meanwhile, the unified framework combines self-supervised learning on URD with supervised learning on LSD, effectively solving the domain inconsistency and enhancing the recognition performance. Extensive experiments demonstrate that CCDPlus outperforms previous state-of-the-art (SOTA) supervised, semi-supervised, and self-supervised methods by an average of 1.8%, 0.6%, and 1.1% on standard datasets, respectively. Additionally, it achieves a 6.1% improvement on the more challenging Union14M-L dataset.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3546-3562"},"PeriodicalIF":0.0,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143417822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uni-MoE: Scaling Unified Multimodal LLMs With Mixture of Experts","authors":"Yunxin Li;Shenyuan Jiang;Baotian Hu;Longyue Wang;Wanqi Zhong;Wenhan Luo;Lin Ma;Min Zhang","doi":"10.1109/TPAMI.2025.3532688","DOIUrl":"10.1109/TPAMI.2025.3532688","url":null,"abstract":"Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named <bold>Uni-MoE</b> that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3424-3439"},"PeriodicalIF":0.0,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143417824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuting He;Boyu Wang;Rongjun Ge;Yang Chen;Guanyu Yang;Shuo Li
{"title":"Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning","authors":"Yuting He;Boyu Wang;Rongjun Ge;Yang Chen;Guanyu Yang;Shuo Li","doi":"10.1109/TPAMI.2025.3540644","DOIUrl":"10.1109/TPAMI.2025.3540644","url":null,"abstract":"Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of <italic>large-scale false positive and negative</i> (FP&N) <italic>pairs</i> in DCRL. In this paper, we propose <bold>GE</b>o<bold>M</b>etric v<bold>I</b>sual de<bold>N</b>se s<bold>I</b>milarity (<bold>GEMINI</b>) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We proposes a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels’ correspondence under the condition of topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via gradient. We also proposes a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code at a companion <italic>website</i>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4122-4139"},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143393229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ángela R. Troncoso-García;María Martínez-Ballesteros;Francisco Martínez-Álvarez;Alicia Troncoso
{"title":"A New Metric Based on Association Rules to Assess Feature-Attribution Explainability Techniques for Time Series Forecasting","authors":"Ángela R. Troncoso-García;María Martínez-Ballesteros;Francisco Martínez-Álvarez;Alicia Troncoso","doi":"10.1109/TPAMI.2025.3540513","DOIUrl":"10.1109/TPAMI.2025.3540513","url":null,"abstract":"This paper introduces a new, model-independent, metric, called RExQUAL, for quantifying the quality of explanations provided by attribution-based explainable artificial intelligence techniques and compare them. The underlying idea is based on feature attribution, using a subset of the ranking of the attributes highlighted by a model-agnostic explainable method in a forecasting task. Then, association rules are generated using these key attributes as input data. Novel metrics, including global support and confidence, are proposed to assess the joint quality of generated rules. Finally, the quality of the explanations is calculated based on a wise and comprehensive combination of the association rules global metrics. The proposed method integrates local explanations through attribution-based approaches for evaluation and feature selection with global explanations for the entire dataset. This paper rigorously evaluates the new metric by comparing three explainability techniques: the widely used SHAP and LIME, and the novel methodology RULEx. The experimental design includes predicting time series of different natures, including univariate and multivariate, through deep learning models. The results underscore the efficacy and versatility of the proposed methodology as a quantitative framework for evaluating and comparing explainable techniques.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4140-4155"},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10879535","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143393228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-Level Matching With Outlier Filtering for Unsupervised Visible-Infrared Person Re-Identification","authors":"Mang Ye;Zesen Wu;Bo Du","doi":"10.1109/TPAMI.2025.3541053","DOIUrl":"10.1109/TPAMI.2025.3541053","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality retrieval task due to the large modality gap. While numerous efforts have been devoted to the supervised setting with a large amount of labeled cross-modality correspondences, few studies have tried to mitigate the modality gap by mining cross-modality correspondences in an unsupervised manner. However, existing works failed to capture the intrinsic relations among samples across two modalities, resulting in limited performance outcomes. In this paper, we propose a novel Progressive Graph Matching (PGM) approach to globally model the cross-modality relationships and instance-level affinities. PGM formulates cross-modality correspondence mining as a graph matching procedure, aiming to integrate global information by minimizing global matching costs. Considering that samples in wrong clusters cannot find reliable cross-modality correspondences by PGM, we further introduce a robust Dual-Level Matching (DLM) mechanism, combining the cluster-level PGM and Nearest Instance-Cluster Searching (NICS) with instance-level affinity optimization. Additionally, we design an Outlier Filter Strategy (OFS) to filter out unreliable cross-modality correspondences based on the dual-level relation constraints. To mitigate false accumulation in cross-modal correspondence learning, an Alternate Cross Contrastive Learning (ACCL) module is proposed to alternately adjust the dominated matching, i.e., visible-to-infrared or infrared-to-visible matching. Empirical results demonstrate the superiority of our unsupervised solution, achieving comparable performance with supervised counterparts.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3815-3829"},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143393227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MIXRTs: Toward Interpretable Multi-Agent Reinforcement Learning via Mixing Recurrent Soft Decision Trees","authors":"Zichuan Liu;Yuanyang Zhu;Zhi Wang;Yang Gao;Chunlin Chen","doi":"10.1109/TPAMI.2025.3540467","DOIUrl":"10.1109/TPAMI.2025.3540467","url":null,"abstract":"While achieving tremendous success in various fields, existing multi-agent reinforcement learning (MARL) with a black-box neural network makes decisions in an opaque manner that hinders humans from understanding the learned knowledge and how input observations influence decisions. In contrast, existing interpretable approaches usually suffer from weak expressivity and low performance. To bridge this gap, we propose MIXing Recurrent soft decision Trees (MIXRTs), a novel interpretable architecture that can represent explicit decision processes via the root-to-leaf path and reflect each agent’s contribution to the team. Specifically, we construct a novel soft decision tree using a recurrent structure and demonstrate which features influence the decision-making process. Then, based on the value decomposition framework, we linearly assign credit to each agent by explicitly mixing individual action values to estimate the joint action value using only local observations, providing new insights into interpreting the cooperation mechanism. Theoretical analysis confirms that MIXRTs guarantee additivity and monotonicity in the factorization of joint action values. Evaluations on complex tasks like Spread and StarCraft II demonstrate that MIXRTs compete with existing methods while providing clear explanations, paving the way for interpretable and high-performing MARL systems.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4090-4107"},"PeriodicalIF":0.0,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143393230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convection-Diffusion Equation: A Theoretically Certified Framework for Neural Networks","authors":"Tangjun Wang;Chenglong Bao;Zuoqiang Shi","doi":"10.1109/TPAMI.2025.3540310","DOIUrl":"10.1109/TPAMI.2025.3540310","url":null,"abstract":"Differential equations have demonstrated intrinsic connections to network structures, linking discrete network layers through continuous equations. Most existing approaches focus on the interaction between ordinary differential equations (ODEs) and feature transformations, primarily working on input signals. In this paper, we study the partial differential equation (PDE) model of neural networks, viewing the neural network as a functional operating on a base model provided by the last layer of the classifier. Inspired by scale-space theory, we theoretically prove that this mapping can be formulated by a convection-diffusion equation, under interpretable and intuitive assumptions from both neural network and PDE perspectives. This theoretically certified framework covers various existing network structures and training techniques, offering a mathematical foundation and new insights into neural networks. Moreover, based on the convection-diffusion equation model, we design a new network structure that incorporates a diffusion mechanism into the network architecture from a PDE perspective. Extensive experiments on benchmark datasets and real-world applications confirm the effectiveness of the proposed model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4170-4182"},"PeriodicalIF":0.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143385484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pol Caselles;Eduard Ramon;Jaime García;Gil Triginer;Francesc Moreno-Noguer
{"title":"Implicit Shape and Appearance Priors for Few-Shot Full Head Reconstruction","authors":"Pol Caselles;Eduard Ramon;Jaime García;Gil Triginer;Francesc Moreno-Noguer","doi":"10.1109/TPAMI.2025.3540542","DOIUrl":"10.1109/TPAMI.2025.3540542","url":null,"abstract":"Recent advancements in learning techniques that employ coordinate-based neural representations have yielded remarkable results in multi-view 3D reconstruction tasks. However, these approaches often require a substantial number of input views (typically several tens) and computationally intensive optimization procedures to achieve their effectiveness. In this paper, we address these limitations specifically for the problem of few-shot full 3D head reconstruction. We accomplish this by incorporating a probabilistic shape and appearance prior into coordinate-based representations, enabling faster convergence and improved generalization when working with only a few input images (even as low as a single image). During testing, we leverage this prior to guiding the fitting process of a signed distance function using a differentiable renderer. By incorporating the statistical prior alongside parallelizable ray tracing and dynamic caching strategies, we achieve an efficient and accurate approach to few-shot full 3D head reconstruction. Moreover, we extend the H3DS dataset, which now comprises 60 high-resolution 3D full-head scans and their corresponding posed images and masks, which we use for evaluation purposes. By leveraging this dataset, we demonstrate the remarkable capabilities of our approach in achieving state-of-the-art results in geometry reconstruction while being an order of magnitude faster than previous approaches.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3691-3705"},"PeriodicalIF":0.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143385483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Digging Deeper in Gradient for Unrolling-Based Accelerated MRI Reconstruction","authors":"Faming Fang;Tingting Wang;Guixu Zhang;Fang Li","doi":"10.1109/TPAMI.2025.3540218","DOIUrl":"10.1109/TPAMI.2025.3540218","url":null,"abstract":"There are two main methods that can be used to accelerate MRI reconstruction: parallel imaging and compressed sensing. To further accelerate the sampling process, the combination of these two methods has been extensively studied in recent years. However, existing MRI reconstruction methods often overlook the exploration of high-frequency information of images, leading to sub-optimal recovery of fine details in the reconstructed results. To address this issue, we conduct an in-depth analysis of image gradients and propose a novel MRI reconstruction model based on Maximum a Posteriori (MAP) estimation. We first establish the Cumulative Deviation from Maximum Gradient magnitude (CDMG) prior for fully sampled MR images through theoretical analysis, then incorporate this explicit CDMG prior along with an implicit deep prior to form the prior probability term. This combination of priors strikes a balance between physically informed constraints and data-driven adaptability, aiding in the recovery of meaningful high-frequency information. Additionally, we introduce a multi-order gradient operator to enhance the observation model, thereby improving the accuracy of the likelihood term. Through MAP estimation, we develop a novel accelerated MRI reconstruction model, the optimization of which is achieved by unrolling it into a convolutional neural network structure, referred to as DDGU-Net. Extensive experimental results demonstrate the effectiveness of our approach in reconstructing high-quality MR images and achieving state-of-the-art (SOTA) results, particularly at higher acceleration factors.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4156-4169"},"PeriodicalIF":0.0,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MS-NeRF: Multi-Space Neural Radiance Fields","authors":"Ze-Xin Yin;Peng-Yi Jiao;Jiaxiong Qiu;Ming-Ming Cheng;Bo Ren","doi":"10.1109/TPAMI.2025.3540074","DOIUrl":"10.1109/TPAMI.2025.3540074","url":null,"abstract":"Existing Neural Radiance Fields (NeRF) methods suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multi-space neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward the existence of reflective and refractive objects. Our multi-space scheme works as an enhancement to existing NeRF methods, with only small computational overheads needed for training and inferring the extra-space outputs. We design different multi-space modules for representative MLP-based and grid-based NeRF methods, which improve Mip-NeRF 360 by 4.15 dB in PSNR with 0.5% extra parameters and further improve TensoRF by 2.71 dB with 0.046% extra parameters on reflective regions without degrading the rendering quality on other regions. We further construct a novel dataset consisting of 33 synthetic scenes and 7 real captured scenes with complex reflection and refraction, where we design complex camera paths to fully benchmark the robustness of NeRF-based methods. Extensive experiments show that our approach significantly outperforms the existing single-space NeRF methods for rendering high-quality scenes concerned with complex light paths through mirror-like objects.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3766-3783"},"PeriodicalIF":0.0,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}