Wenxuan Tu;Sihang Zhou;Xinwang Liu;Zhiping Cai;Yawei Zhao;Yue Liu;Kunlun He
{"title":"WAGE: Weight-Sharing Attribute-Missing Graph Autoencoder","authors":"Wenxuan Tu;Sihang Zhou;Xinwang Liu;Zhiping Cai;Yawei Zhao;Yue Liu;Kunlun He","doi":"10.1109/TPAMI.2025.3554053","DOIUrl":"10.1109/TPAMI.2025.3554053","url":null,"abstract":"Attribute-missing graph learning, a common yet challenging problem, has recently attracted considerable attention. Existing efforts have at least one of the following limitations: 1) lack a noise filtering and information enhancing scheme, resulting in less comprehensive data completion; 2) isolate the node attribute and graph structure encoding processes, introducing more parameters and failing to take full advantage of the two types of information; and 3) impose overly strict distribution assumptions on the latent variables, leading to biased or less discriminative node representations. To tackle the issues, based on the idea of introducing intimate information interaction between the two information sources, we propose <u><b>W</b></u>eight-sharing <u><b>A</b></u>ttribute-missing <u><b>G</b></u>raph auto<u><b>E</b></u>ncoder (WAGE) to boost the expressive capacity of node representations for high-quality missing attribute reconstruction. Specifically, three strategies have been conducted. Firstly, we entangle the attribute embedding and structure embedding by introducing a weight-sharing architecture to share the parameters learned by both processes, which allows the network training to benefit from more abundant and diverse information. Secondly, we introduce a <inline-formula><tex-math>$K$</tex-math></inline-formula>-nearest neighbor-based dual non-local learning mechanism to improve the quality of data imputation by revealing unobserved high-confidence connections while filtering unreliable ones. Thirdly, we manually mask the connections on multiple adjacency matrices and force the structure-oriented embedding sub-network to recover the actual adjacency matrix, thus enforcing the resulting network to be able to selectively exploit more high-order discriminative features for data completion. Extensive experiments on six benchmark datasets demonstrate the effectiveness and superiority of WAGE against state-of-the-art competitors.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5760-5777"},"PeriodicalIF":0.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143702815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Addressing Information Asymmetry: Deep Temporal Causality Discovery for Mixed Time Series","authors":"Jiawei Chen;Chunhui Zhao","doi":"10.1109/TPAMI.2025.3553957","DOIUrl":"10.1109/TPAMI.2025.3553957","url":null,"abstract":"While existing causal discovery methods mostly focus on continuous time series, causal discovery for mixed time series encompassing both continuous variables (CVs) and discrete variables (DVs) is a fundamental yet underexplored problem. Together with nonlinearity and high dimensionality, mixed time series pose significant challenges for causal discovery. This study addresses the aforementioned challenges based on the following recognitions: 1) DVs may originate from latent continuous variables (LCVs) and undergo discretization processes due to measurement limitations, storage requirements, and other reasons. 2) LCVs contain fine-grained information and interact with CVs. By leveraging these interactions, the intrinsic continuity of DVs can be recovered. Thereupon, we propose a generic deep mixed time series temporal causal discovery framework. Our key idea is to adaptively recover LCVs from DVs with the guidance of CVs and perform causal discovery in a unified continuous-valued space. Technically, a new contextual adaptive Gaussian kernel embedding technique is developed for latent continuity recovery by adaptively aggregating temporal contextual information of DVs. Accordingly, two interdependent model training stages are devised for learning the latent continuity recovery with self-supervision and causal structure learning with sparsity-induced optimization. Experimentally, extensive empirical evaluations and in-depth investigations validate the superior performance of our framework.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5723-5741"},"PeriodicalIF":0.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143702469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Guo;Yi Mei;Mengjie Zhang;Haoran Zhao;Kaiquan Cai;Wenbo Du
{"title":"Learning-Aided Neighborhood Search for Vehicle Routing Problems","authors":"Tong Guo;Yi Mei;Mengjie Zhang;Haoran Zhao;Kaiquan Cai;Wenbo Du","doi":"10.1109/TPAMI.2025.3554669","DOIUrl":"10.1109/TPAMI.2025.3554669","url":null,"abstract":"The Vehicle Routing Problem (VRP) is a classic optimization problem with diverse real-world applications. The neighborhood search has emerged as an effective approach, yielding high-quality solutions across different VRPs. However, most existing studies exhaustively explore all considered neighborhoods with a pre-fixed order, leading to an inefficient search process. To address this issue, this paper proposes a Learning-aided Neighborhood Search algorithm (LaNS) that employs a cutting-edge multi-agent reinforcement learning-driven adaptive operator/neighborhood selection mechanism to achieve efficient routing for VRP. Within this framework, two agents serve as high-level instructors, collaboratively guiding the search direction by selecting perturbation/improvement operators from a pool of low-level heuristics. Furthermore, to equip the agents with comprehensive information for learning guidance knowledge, we have developed a new informative state representation. This representation transforms the spatial route structures into an image-like tensor, allowing us to extract spatial features using a convolutional neural network. Comprehensive evaluations on diverse VRP benchmarks, including the capacitated VRP (CVRP), multi-depot VRP (MDVRP) and cumulative multi-depot VRP with energy constraints, demonstrate LaNS's superiority over the state-of-the-art neighborhood search methods as well as the existing learning-guided neighborhood search algorithms.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5930-5944"},"PeriodicalIF":0.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143702483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingyu Liu;Ruida Zhang;Chenyangguang Zhang;Gu Wang;Jiwen Tang;Zhigang Li;Xiangyang Ji
{"title":"GDRNPP: A Geometry-Guided and Fully Learning-Based Object Pose Estimator","authors":"Xingyu Liu;Ruida Zhang;Chenyangguang Zhang;Gu Wang;Jiwen Tang;Zhigang Li;Xiangyang Ji","doi":"10.1109/TPAMI.2025.3553485","DOIUrl":"10.1109/TPAMI.2025.3553485","url":null,"abstract":"6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based P <inline-formula><tex-math>$n$</tex-math></inline-formula> P or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5742-5759"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Chen;Zihan Wang;Ran Tao;Hongxin Wei;Xing Xie;Masashi Sugiyama;Bhiksha Raj;Jindong Wang
{"title":"Impact of Noisy Supervision in Foundation Model Learning","authors":"Hao Chen;Zihan Wang;Ran Tao;Hongxin Wei;Xing Xie;Masashi Sugiyama;Bhiksha Raj;Jindong Wang","doi":"10.1109/TPAMI.2025.3552309","DOIUrl":"10.1109/TPAMI.2025.3552309","url":null,"abstract":"Foundation models are usually pre-trained on large-scale datasets and then adapted to different downstream tasks through tuning. This pre-training and then fine-tuning paradigm has become a standard practice in deep learning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1 K, YFCC15 M, and CC12 M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners, considering one may not be able to access or fully fine-tune the pre-trained models. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as <italic>Noisy Model Transfer Learning</i>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5690-5707"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Man-Sheng Chen;Pei-Yuan Lai;De-Zhang Liao;Chang-Dong Wang;Jian-Huang Lai
{"title":"Graph Prompt Clustering","authors":"Man-Sheng Chen;Pei-Yuan Lai;De-Zhang Liao;Chang-Dong Wang;Jian-Huang Lai","doi":"10.1109/TPAMI.2025.3553129","DOIUrl":"10.1109/TPAMI.2025.3553129","url":null,"abstract":"Due to the wide existence of unlabeled graph-structured data (e.g., molecular structures), the graph-level clustering has recently attracted increasing attention, whose goal is to divide the input graphs into several disjoint groups. However, the existing methods habitually focus on learning the graphs embeddings with different graph reguralizations, and seldom refer to the obvious differences in data distributions of distinct graph-level datasets. How to characteristically consider multiple graph-level datasets in a general well-designed model without prior knowledge is still challenging. In view of this, we propose a novel Graph Prompt Clustering (GPC) method. Within this model, there are two main modules, i.e., graph model pretraining as well as prompt and finetuning. In the graph model pretraining module, the graph model is pretrained by a selected source graph-level dataset with mutual information maximization and self-supervised clustering regularization. In the prompt and finetuning module, the network parameters of the pretrained graph model are frozen, and a groups of learnable prompt vectors assigned to each graph-level representation are trained for adapting different target graph-level datasets with various data distributions. Experimental results across six benchmark datasets demonstrate the impressive generalization capability and effectiveness of GPC compared with the state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5794-5805"},"PeriodicalIF":0.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reason and Discovery: A New Paradigm for Open Set Recognition","authors":"Yimin Fu;Zhunga Liu;Jialin Lyu","doi":"10.1109/TPAMI.2025.3552760","DOIUrl":"10.1109/TPAMI.2025.3552760","url":null,"abstract":"Open set recognition (OSR) effectively enhances the reliability of pattern recognition systems by accurately identifying samples of unknown classes. However, the decision-making process in most existing OSR methods adheres to an ill-considered pipeline, where classification probabilities are inferred directly from overall feature representations, neglecting the reasoning about inherent relations. Besides, the handling of identified unknown samples is typically restricted to the assignment of a generic “unknown” class label but fails to explore underlying category information. To tackle the above challenges, we propose a new paradigm for OSR, entitled Reason and Discovery (RAD), which comprises two main modules: the Reason Module and the Discovery Module. Specifically, in the Reason Module, the distinction between known and unknown is performed from the perspective of reasoning the matching relations between topological information and appearance characteristics of discriminative regions. Then, the mixture and recombination of relation representations across classes are employed to provide diverse estimations of unknown distribution, thereby recalibrating OSR decision boundaries. Moreover, in the Discovery Module, the identified unknown samples are semantically grouped through a biased deep clustering process for discovering novel category information. Experimental results on various datasets indicate that the proposed method can achieve outstanding OSR performance and good novel category discovery efficacy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5586-5599"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DVIS++: Improved Decoupled Framework for Universal Video Segmentation","authors":"Tao Zhang;Xingye Tian;Yikang Zhou;Shunping Ji;Xuebo Wang;Xin Tao;Yuan Zhang;Pengfei Wan;Zhongyuan Wang;Yu Wu","doi":"10.1109/TPAMI.2025.3552694","DOIUrl":"10.1109/TPAMI.2025.3552694","url":null,"abstract":"We present the <bold>D</b>ecoupled <bold>VI</b>deo <bold>S</b>egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. The proposed decoupled framework efficiently handles universal and open-vocabulary object representations, allowing DVIS++ to conduct universal and open-vocabulary video segmentation. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in closed- and open-vocabulary settings.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5918-5929"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu
{"title":"Low-Shot Video Object Segmentation","authors":"Kun Yan;Fangyun Wei;Shuyu Dai;Minghui Wu;Ping Wang;Chang Xu","doi":"10.1109/TPAMI.2025.3552779","DOIUrl":"10.1109/TPAMI.2025.3552779","url":null,"abstract":"Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5538-5555"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating Inverse Feature Space for Class Imbalance in Point Cloud Semantic Segmentation","authors":"Jiawei Han;Kaiqi Liu;Wei Li;Feng Zhang;Xiang-Gen Xia","doi":"10.1109/TPAMI.2025.3553051","DOIUrl":"10.1109/TPAMI.2025.3553051","url":null,"abstract":"Point cloud semantic segmentation can enhance the understanding of the production environment and is a crucial component of vision tasks. The efficacy and generalization prowess of deep learning-based segmentation models are inherently contingent upon the quality and nature of the data employed in their training. However, it is often challenging to obtain data with inter-class balance, and training an intelligent segmentation network with the imbalanced data may cause cognitive bias. In this paper, a network framework InvSpaceNet is proposed, which generates an inverse feature space to alleviate the cognitive bias caused by imbalanced data. Specifically, we design a dual-branch training architecture that combines the superior feature representations derived from instance-balanced sampling data with the cognitive corrections introduced by the proposed inverse sampling data. In the inverse feature space of the point cloud generated by the auxiliary branch, the central points aggregated by class are constrained by the contrastive loss. To refine the class cognition in the inverse feature space, features are used to generate point cloud class prototypes through momentum update. These class prototypes from the inverse space are utilized to generate feature maps and structure maps that are aligned with the positive feature space of the main branch segmentation network. The training of the main branch is dynamically guided through gradients back propagated from different losses. Extensive experiments conducted on four large benchmarks (i.e., S3DIS, ScanNet v2, Toronto-3D, and SemanticKITTI) demonstrate that the proposed method can effectively mitigate point cloud imbalance issues and improve segmentation performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5778-5793"},"PeriodicalIF":0.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}