{"title":"Rapid Salient Object Detection With Difference Convolutional Neural Networks","authors":"Zhuo Su;Li Liu;Matthias Müller;Jiehua Zhang;Diana Wofk;Ming-Ming Cheng;Matti Pietikäinen","doi":"10.1109/TPAMI.2025.3583968","DOIUrl":"10.1109/TPAMI.2025.3583968","url":null,"abstract":"This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with <inline-formula><tex-math>$< $</tex-math></inline-formula> 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than <inline-formula><tex-math>$2times$</tex-math></inline-formula> and <inline-formula><tex-math>$3times$</tex-math></inline-formula> in speed with superior accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9061-9077"},"PeriodicalIF":18.6,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144565573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CNN2GNN: How to Bridge CNN With GNN","authors":"Ziheng Jiao;Hongyuan Zhang;Xuelong Li","doi":"10.1109/TPAMI.2025.3583357","DOIUrl":"10.1109/TPAMI.2025.3583357","url":null,"abstract":"Thanks to extracting the intra-sample representation, the convolution neural network (CNN) has achieved excellent performance in vision tasks. However, its numerous convolutional layers take a higher training expense. Recently, graph neural networks (GNN), a bilinear model, have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, due to the lack of graph structure and high-cost inference on large-scale scenarios, it cannot be directly utilized on non-graph data. Inspired by these complementary strengths and weaknesses, <italic>we discuss a natural question, how to bridge these two heterogeneous networks?</i> In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. First, to break the limitations of GNN, we design a differentiable sparse graph learning module as the head of the networks. It can dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of the distilled “boosted” two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers, such as ResNet152.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9367-9374"},"PeriodicalIF":18.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yushi Huang;Ruihao Gong;Xianglong Liu;Jing Liu;Yuhang Li;Jiwen Lu;Dacheng Tao
{"title":"Temporal Feature Matters: A Framework for Diffusion Model Quantization","authors":"Yushi Huang;Ruihao Gong;Xianglong Liu;Jing Liu;Yuhang Li;Jiwen Lu;Dacheng Tao","doi":"10.1109/TPAMI.2025.3585692","DOIUrl":"10.1109/TPAMI.2025.3585692","url":null,"abstract":"Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues. However, unlike traditional models, diffusion models critically rely on the time-step for the multi-round denoising. Typically, each time-step is encoded into a hypersensitive temporal feature by several modules. Despite this, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory, as well as reduced compression efficiency. To address these challenges, we introduce a novel quantization framework that includes three strategies: 1) <italic>TIB-based Maintenance</i>: Based on our innovative Temporal Information Block (TIB) definition, Temporal Information-aware Reconstruction (TIAR) and Finite Set Calibration (FSC) are developed to efficiently align original temporal features. 2) <italic>Cache-based Maintenance</i>: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3) <italic>Disturbance-aware Selection</i>: Employ temporal feature errors to guide a fine-grained selection between the two maintenance strategies for further disturbance reduction. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets, diffusion models and hardware confirms our superior performance and acceleration.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8823-8837"},"PeriodicalIF":18.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haotong Qin;Xianglong Liu;Xudong Ma;Lei Ke;Yulun Zhang;Jie Luo;Michele Magno
{"title":"BiVM: Accurate Binarized Neural Network for Efficient Video Matting","authors":"Haotong Qin;Xianglong Liu;Xudong Ma;Lei Ke;Yulun Zhang;Jie Luo;Michele Magno","doi":"10.1109/TPAMI.2025.3584928","DOIUrl":"10.1109/TPAMI.2025.3584928","url":null,"abstract":"Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representations from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in fullprecision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9250-9265"},"PeriodicalIF":18.6,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144546910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kernelized Hypergraph Neural Networks","authors":"Yifan Feng;Yifan Zhang;Shihui Ying;Shaoyi Du;Yue Gao","doi":"10.1109/TPAMI.2025.3585179","DOIUrl":"10.1109/TPAMI.2025.3585179","url":null,"abstract":"Hypergraph Neural Networks (HGNNs) have attracted much attention for high-order structural data learning. Existing methods mainly focus on simple mean-based aggregation or manually combining multiple aggregations to capture multiple information on hypergraphs. However, those methods inherently lack continuous non-linear modeling ability and are sensitive to varied distributions. Although some kernel-based aggregations on GNNs and CNNs can capture non-linear patterns to some degree, those methods are restricted in the low-order correlation and may cause unstable computation in training. In this work, we introduce Kernelized Hypergraph Neural Networks (KHGNN) and its variant, Half-Kernelized Hypergraph Neural Networks (H-KHGNN), which synergize mean-based and max-based aggregation functions to enhance representation learning on hypergraphs. KHGNN’s kernelized aggregation strategy adaptively captures both semantic and structural information via learnable parameters, offering a mathematically grounded blend of kernelized aggregation approaches for comprehensive feature extraction. H-KHGNN addresses the challenge of overfitting in less intricate hypergraphs by employing non-linear aggregation selectively in the vertex-to-hyperedge message-passing process, thus reducing model complexity. Our theoretical contributions reveal a bounded gradient for kernelized aggregation, ensuring stability during training and inference. Empirical results demonstrate that KHGNN and H-KHGNN outperform state-of-the-art models across 10 graph/hypergraph datasets, with ablation studies demonstrating the effectiveness and computational stability of our method.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8938-8954"},"PeriodicalIF":18.6,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144547172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Efficient and Effective Trajectories for Differential Equation-Based Image Restoration","authors":"Zhiyu Zhu;Jinhui Hou;Hui Liu;Huanqiang Zeng;Junhui Hou","doi":"10.1109/TPAMI.2025.3584921","DOIUrl":"10.1109/TPAMI.2025.3584921","url":null,"abstract":"The differential equation-based image restoration approach aims to establish learnable trajectories connecting high-quality images to a tractable distribution, e.g., low-quality images or a Gaussian distribution. In this paper, we reformulate the trajectory optimization of this kind of method, focusing on enhancing both reconstruction quality and efficiency. Initially, we navigate effective restoration paths through a reinforcement learning process, gradually steering potential trajectories toward the most precise options. Additionally, to mitigate the considerable computational burden associated with iterative sampling, we propose cost-aware trajectory distillation to streamline complex paths into several manageable steps with adaptable sizes. Moreover, we fine-tune a foundational diffusion model (FLUX) with 12B parameters by using our algorithms, producing a unified framework for handling 7 kinds of image restoration tasks. Extensive experiments showcase the <italic>significant</i> superiority of the proposed method, achieving a maximum PSNR improvement of 2.1 dB over state-of-the-art methods, while also greatly enhancing visual perceptual quality.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9150-9168"},"PeriodicalIF":18.6,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144533132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Zhang;Jiahe Li;Lei Huang;Haonan Luo;Xiaohan Yu;Lin Gu;Jin Zheng;Xiao Bai
{"title":"Investigating Synthetic-to-Real Transfer Robustness for Stereo Matching and Optical Flow Estimation","authors":"Jiawei Zhang;Jiahe Li;Lei Huang;Haonan Luo;Xiaohan Yu;Lin Gu;Jin Zheng;Xiao Bai","doi":"10.1109/TPAMI.2025.3584847","DOIUrl":"10.1109/TPAMI.2025.3584847","url":null,"abstract":"With advancements in robust stereo matching and optical flow estimation networks, models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However, their robustness can be seriously degraded when fine-tuning them in real-world scenarios. This paper investigates fine-tuning stereo matching and optical flow estimation networks without compromising their robustness to unseen domains. Specifically, we divide the pixels into consistent and inconsistent regions by comparing Ground Truth (GT) with Pseudo Label (PL) and demonstrate that the imbalance learning of consistent and inconsistent regions in GT causes robustness degradation. Based on our analysis, we propose the DKT framework, which utilizes PL to balance the learning of different regions in GT. The core idea is to utilize an exponential moving average (EMA) teacher to measure what the student network has learned and dynamically adjust the learning regions. We further propose the DKT++ framework, which improves target-domain performances and network robustness by applying slow-fast update teachers to generate more accurate PL, introducing the unlabeled data and synthetic data. We integrate our frameworks with state-of-the-art networks and evaluate their effectiveness on several real-world datasets. Extensive experiments show that our method effectively preserves the robustness of stereo matching and optical flow networks during fine-tuning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9113-9129"},"PeriodicalIF":18.6,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144533131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distilling the Unknown to Unveil Certainty","authors":"Zhilin Zhao;Longbing Cao;Yixuan Zhang;Kun-Yu Lin;Wei-Shi Zheng","doi":"10.1109/TPAMI.2025.3584386","DOIUrl":"10.1109/TPAMI.2025.3584386","url":null,"abstract":"Out-of-distribution (OOD) detection is critical for identifying test samples that deviate from in-distribution (ID) data, ensuring network robustness and reliability. This paper presents a flexible framework for OOD knowledge distillation that extracts OOD-sensitive information from a network to develop a binary classifier capable of distinguishing between ID and OOD samples in both scenarios, with and without access to training ID data. To accomplish this, we introduce Confidence Amendment (CA), an innovative methodology that transforms an OOD sample into an ID one while progressively amending prediction confidence derived from the network to enhance OOD sensitivity. This approach enables the simultaneous synthesis of both ID and OOD samples, each accompanied by an adjusted prediction confidence, thereby facilitating the training of a binary classifier sensitive to OOD. Theoretical analysis provides bounds on the generalization error of the binary classifier, demonstrating the pivotal role of confidence amendment in enhancing OOD sensitivity. Extensive experiments spanning various datasets and network architectures confirm the efficacy of the proposed method in detecting OOD samples.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9232-9249"},"PeriodicalIF":18.6,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144520684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyang Du;Ruichen Zhang;Dusit Niyato;Jiawen Kang;Zehui Xiong;Shuguang Cui;Xuemin Shen;Dong In Kim
{"title":"Reinforcement Learning With LLMs Interaction for Distributed Diffusion Model Services","authors":"Hongyang Du;Ruichen Zhang;Dusit Niyato;Jiawen Kang;Zehui Xiong;Shuguang Cui;Xuemin Shen;Dong In Kim","doi":"10.1109/TPAMI.2025.3584698","DOIUrl":"10.1109/TPAMI.2025.3584698","url":null,"abstract":"Distributed Artificial Intelligence-Generated Content (AIGC) has attracted significant attention, but two key challenges remain: maximizing subjective Quality of Experience (QoE) and improving energy efficiency, which are particularly pronounced in widely adopted Generative Diffusion Model (GDM)-based image generation services. In this paper, we propose a novel user-centric Interactive AI (IAI) approach for service management, with a distributed GDM-based AIGC framework that emphasizes efficient and cooperative deployment. The proposed method restructures the GDM inference process by allowing users with semantically similar prompts to share parts of the denoising chain. Furthermore, to maximize the users’ subjective QoE, we propose an IAI approach, i.e., Reinforcement Learning With Large Language Models Interaction (RLLI), which utilizes Large Language Model (LLM)-empowered generative agents to replicate users interactions, providing real-time and subjective QoE feedback aligned with diverse user personalities. Lastly, we present the GDM-based Deep Deterministic Policy Gradient (G-DDPG) algorithm, adapted to the proposed RLLI framework, to allocate communication and computing resources effectively while accounting for subjective user traits and dynamic wireless conditions. Simulation results demonstrate that G-DDPG improves total QoE by 15% compared with the standard DDPG algorithm.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8838-8855"},"PeriodicalIF":18.6,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144520687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Zhang;Junhui Hou;Siyu Ren;Jinjian Wu;Yixuan Yuan;Guangming Shi
{"title":"Self-Supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration","authors":"Yifan Zhang;Junhui Hou;Siyu Ren;Jinjian Wu;Yixuan Yuan;Guangming Shi","doi":"10.1109/TPAMI.2025.3584625","DOIUrl":"10.1109/TPAMI.2025.3584625","url":null,"abstract":"This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid pose aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid pose. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding the LiDAR-to-camera extrinsic parameters. We demonstrate the efficacy of NCLR by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network’s understanding abilities and effectiveness of learned representation.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9201-9216"},"PeriodicalIF":18.6,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144520685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}