{"title":"On Testing and Learning Quantum Junta Channels","authors":"Zongbo Bao;Penghui Yao","doi":"10.1109/TPAMI.2025.3528648","DOIUrl":"10.1109/TPAMI.2025.3528648","url":null,"abstract":"We consider the problems of testing and learning quantum <inline-formula><tex-math>$k$</tex-math></inline-formula>-junta channels, which are <inline-formula><tex-math>$n$</tex-math></inline-formula>-qubit to <inline-formula><tex-math>$n$</tex-math></inline-formula>-qubit quantum channels acting non-trivially on at most <inline-formula><tex-math>$k$</tex-math></inline-formula> out of <inline-formula><tex-math>$n$</tex-math></inline-formula> qubits and leaving the rest of qubits unchanged. We show the following. 1) An <inline-formula><tex-math>$O(k)$</tex-math></inline-formula>-query algorithm to distinguish whether the given channel is <inline-formula><tex-math>$k$</tex-math></inline-formula>-junta channel or is <i>far</i> from any <inline-formula><tex-math>$k$</tex-math></inline-formula>-junta channels, and a lower bound <inline-formula><tex-math>$Omega (sqrt{k})$</tex-math></inline-formula> on the number of queries and 2) An <inline-formula><tex-math>$widetilde{O}( 4^{k} )$</tex-math></inline-formula>-query algorithm to learn a <inline-formula><tex-math>$k$</tex-math></inline-formula>-junta channel, and a lower bound <inline-formula><tex-math>$Omega ( 4^{k}/k )$</tex-math></inline-formula> on the number of queries. This partially answers an open problem raised by (Chen et al. 2023). In order to settle these problems, we develop a Fourier analysis framework over the space of superoperators and prove several fundamental properties, which extends the Fourier analysis over the space of operators introduced in (Montanaro and Osborne, 2010). The distance metric we consider in this paper is obtained by Fourier analysis, which is essentially the L2-distance between Choi representations. Besides, we introduce <small>Influence-Sample</small> to replace <small>Fourier-Sample</small> proposed in(Atici and Servedio, 2007). Our <small>Influence-Sample</small> includes only single-qubit operations and results in only constant-factor decrease in efficiency.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2991-3002"},"PeriodicalIF":0.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WAKE: Towards Robust and Physically Feasible Trajectory Prediction for Autonomous Vehicles With WAvelet and KinEmatics Synergy","authors":"Chengyue Wang;Haicheng Liao;Zhenning Li;Chengzhong Xu","doi":"10.1109/TPAMI.2025.3529259","DOIUrl":"10.1109/TPAMI.2025.3529259","url":null,"abstract":"Addressing the pervasive challenge of imperfect data in autonomous vehicle (AV) systems, this study pioneers an integrated trajectory prediction model, WAKE, that fuses physics-informed methodologies with sophisticated machine learning techniques. Our model operates in two principal stages: the initial stage utilizes a Wavelet Reconstruction Network to accurately reconstruct missing observations, thereby preparing a robust dataset for further processing. This is followed by the Kinematic Bicycle Model which ensures that reconstructed trajectory predictions adhere strictly to physical laws governing vehicular motion. The integration of these physics-based insights with a subsequent machine learning stage, featuring a Quantum Mechanics-Inspired Interaction-aware Module, allows for sophisticated modeling of complex vehicle interactions. This fusion approach not only enhances the prediction accuracy but also enriches the model's ability to handle real-world variability and unpredictability. Extensive tests using specific versions of MoCAD, NGSIM, HighD, INTERACTION, and nuScenes datasets featuring missing observational data, have demonstrated the superior performance of our model in terms of both accuracy and physical feasibility, particularly in scenarios with significant data loss—up to 75% missing observations. Our findings underscore the potency of combining physics-informed models with advanced machine learning frameworks to advance autonomous driving technologies, aligning with the interdisciplinary nature of information fusion.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3126-3140"},"PeriodicalIF":0.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Wang;Mingxuan Ye;Yufei Kuang;Rui Yang;Wengang Zhou;Houqiang Li;Feng Wu
{"title":"Long-Term Feature Extraction via Frequency Prediction for Efficient Reinforcement Learning","authors":"Jie Wang;Mingxuan Ye;Yufei Kuang;Rui Yang;Wengang Zhou;Houqiang Li;Feng Wu","doi":"10.1109/TPAMI.2025.3529264","DOIUrl":"10.1109/TPAMI.2025.3529264","url":null,"abstract":"Sample efficiency remains a key challenge for the deployment of deep reinforcement learning (RL) in real-world scenarios. A common approach is to learn efficient representations through future prediction tasks, facilitating the agent to make farsighted decisions that benefit its long-term performance. Existing methods extract predictive features by predicting multi-step future state signals. However, they do not fully exploit the structural information inherent in sequential state signals, which can potentially improve the quality of long-term decision-making but is difficult to discern in the time domain. To tackle this problem, we introduce a new perspective that leverages the frequency domain of state sequences to extract the underlying patterns in time series data. We theoretically show that state sequences contain structural information closely tied to policy performance and signal regularity and analyze the fitness of the frequency domain for extracting these two types of structural information. Inspired by that, we propose a novel representation learning method, <bold>S</b>tate Sequences <bold>P</b>rediction via <bold>F</b>ourier Transform (SPF), which extracts long-term features by predicting the Fourier transform of infinite-step future state sequences. The appealing features of our frequency prediction objective include: 1) simple to implement due to a recursive relationship; 2) providing an upper bound on the performance difference between the optimal policy and the latent policy in the representation space. Experiments on standard and goal-conditioned RL tasks demonstrate that the proposed method outperforms several state-of-the-art algorithms in terms of both sample efficiency and performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3094-3110"},"PeriodicalIF":0.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Robust Point Cloud Recognition With Sample-Adaptive Auto-Augmentation","authors":"Jianan Li;Jie Wang;Junjie Chen;Tingfa Xu","doi":"10.1109/TPAMI.2025.3528392","DOIUrl":"10.1109/TPAMI.2025.3528392","url":null,"abstract":"Robust 3D perception amidst corruption is a crucial task in the realm of 3D vision. Conventional data augmentation methods aimed at enhancing corruption robustness typically apply random transformations to all point cloud samples offline, neglecting sample structure, which often leads to over- or under-enhancement. In this study, we propose an alternative approach to address this issue by employing sample-adaptive transformations based on sample structure, through an auto-augmentation framework named AdaptPoint++. Central to this framework is an imitator, which initiates with Position-aware Feature Extraction to derive intrinsic structural information from the input sample. Subsequently, a Deformation Controller and a Mask Controller predict per-anchor deformation and per-point masking parameters, respectively, facilitating corruption simulations. In conjunction with the imitator, a discriminator is employed to curb the generation of excessive corruption that deviates from the original data distribution. Moreover, we integrate a perception-guidance feedback mechanism to steer the generation of samples towards an appropriate difficulty level. To effectively train the classifier using the generated augmented samples, we introduce a Structure Reconstruction-assisted learning mechanism, bolstering the classifier's robustness by prioritizing intrinsic structural characteristics over superficial discrepancies induced by corruption. Additionally, to alleviate the scarcity of real-world corrupted point cloud data, we introduce two novel datasets: ScanObjectNN-C and MVPNET-C, closely resembling actual data in real-world scenarios. Experimental results demonstrate that our method attains state-of-the-art performance on multiple corruption benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3003-3017"},"PeriodicalIF":0.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation","authors":"Lihe Yang;Zhen Zhao;Hengshuang Zhao","doi":"10.1109/TPAMI.2025.3528453","DOIUrl":"10.1109/TPAMI.2025.3528453","url":null,"abstract":"Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3031-3048"},"PeriodicalIF":0.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Biased Stochastic Optimization","authors":"Zhuang Yang","doi":"10.1109/TPAMI.2025.3528193","DOIUrl":"10.1109/TPAMI.2025.3528193","url":null,"abstract":"This work develops and analyzes a class of adaptive biased stochastic optimization (ABSO) algorithms from the perspective of the GEneralized Adaptive gRadient (GEAR) method that contains Adam, AdaGrad, RMSProp, etc. Particularly, two preferred biased stochastic optimization (BSO) algorithms, the biased stochastic variance reduction gradient (BSVRG) algorithm and the stochastic recursive gradient algorithm (SARAH), equipped with GEAR, are first considered in this work, leading to two ABSO algorithms: BSVRG-GEAR and SARAH-GEAR. We present a uniform analysis of ABSO algorithms for minimizing strongly convex (SC) and Polyak-Łojasiewicz (PŁ) composite objective functions. Second, we also use our framework to develop another novel BSO algorithm, adaptive biased stochastic conjugate gradient (coined BSCG-GEAR), which achieves the well-known oracle complexity. Specifically, under mild conditions, we prove that the resulting ABSO algorithms attain a linear convergence rate on both PŁ and SC cases. Moreover, we show that the complexity of the resulting ABSO algorithms is comparable to that of advanced stochastic gradient-based algorithms. Finally, we demonstrate the empirical superiority and the numerical stability of the resulting ABSO algorithms by conducting numerical experiments on different applications of machine learning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3067-3078"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DHVT: Dynamic Hybrid Vision Transformer for Small Dataset Recognition","authors":"Zhiying Lu;Chuanbin Liu;Xiaojun Chang;Yongdong Zhang;Hongtao Xie","doi":"10.1109/TPAMI.2025.3528228","DOIUrl":"10.1109/TPAMI.2025.3528228","url":null,"abstract":"The performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) persists due to the lack of inductive bias, notably when training from scratch with limited datasets. This paper identifies two crucial shortcomings in ViTs: <italic>spatial relevance</i> and <italic>diverse channel representation</i>. Thus, ViTs struggle to grasp fine-grained spatial features and robust channel representation due to insufficient data. We propose the Dynamic Hybrid Vision Transformer (DHVT) to address these challenges. Regarding the spatial aspect, DHVT introduces convolution in the feature embedding phase and feature projection modules to enhance spatial relevance. Regarding the channel aspect, the dynamic aggregation mechanism and a groundbreaking design “head token” facilitate the recalibration and harmonization of disparate channel representations. Moreover, we investigate the choices of the network meta-structure and adopt the optimal multi-stage hybrid structure without the conventional class token. The methods are then modified with a novel dimensional variable residual connection mechanism to leverage the potential of the structure sufficiently. This updated variant, called DHVT2, offers a more computationally efficient solution for vision-related tasks. DHVT and DHVT2 achieve state-of-the-art image recognition results, effectively bridging the performance gap between CNNs and ViTs. The downstream experiments further demonstrate their strong generalization capacities.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2615-2631"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu
{"title":"DiffTF++: 3D-Aware Diffusion Transformer for Large-Vocabulary 3D Generation","authors":"Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu","doi":"10.1109/TPAMI.2025.3528247","DOIUrl":"10.1109/TPAMI.2025.3528247","url":null,"abstract":"Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges <italic>with a single model</i>. To handle the large diversity and complexity in geometry and texture across categories efficiently, we <bold>1</b>) adopt improved triplane to guarantee efficiency; <bold>2</b>) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and <bold>3</b>) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware <bold>Diff</b>usion model with <bold>T</b>rans<bold>F</b>ormer, <bold>DiffTF</b>, we propose a stronger version for 3D generation, i.e., <bold>DiffTF++</b>. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3018-3030"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction","authors":"Lingling Zhang;Yujie Zhong;Qinghua Zheng;Jun Liu;Qianying Wang;Jiaxin Wang;Xiaojun Chang","doi":"10.1109/TPAMI.2025.3528246","DOIUrl":"10.1109/TPAMI.2025.3528246","url":null,"abstract":"Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2647-2659"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Latent Weight Quantization for Integerized Training of Deep Neural Networks","authors":"Wen Fei;Wenrui Dai;Liang Zhang;Luoming Zhang;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TPAMI.2025.3527498","DOIUrl":"10.1109/TPAMI.2025.3527498","url":null,"abstract":"Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2816-2832"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}