Yanpeng Zhao;Yiwei Hao;Siyu Gao;Yunbo Wang;Xiaokang Yang
{"title":"Dynamic Scene Understanding Through Object-Centric Voxelization and Neural Rendering","authors":"Yanpeng Zhao;Yiwei Hao;Siyu Gao;Yunbo Wang;Xiaokang Yang","doi":"10.1109/TPAMI.2025.3539866","DOIUrl":"10.1109/TPAMI.2025.3539866","url":null,"abstract":"Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4215-4231"},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10877772","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143258496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongzhan Huang;Shanshan Zhong;Pan Zhou;Shanghua Gao;Marinka Zitnik;Liang Lin
{"title":"A Causality-Aware Paradigm for Evaluating Creativity of Multimodal Large Language Models","authors":"Zhongzhan Huang;Shanshan Zhong;Pan Zhou;Shanghua Gao;Marinka Zitnik;Liang Lin","doi":"10.1109/TPAMI.2025.3539433","DOIUrl":"10.1109/TPAMI.2025.3539433","url":null,"abstract":"Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game—a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. Project Page.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3830-3846"},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143258462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unay Dorken Gallastegi;Hoover Rueda-Chacón;Martin J. Stevens;Vivek K Goyal
{"title":"Absorption-Based, Passive Range Imaging From Hyperspectral Thermal Measurements","authors":"Unay Dorken Gallastegi;Hoover Rueda-Chacón;Martin J. Stevens;Vivek K Goyal","doi":"10.1109/TPAMI.2025.3538711","DOIUrl":"10.1109/TPAMI.2025.3538711","url":null,"abstract":"Passive hyperspectral longwave infrared measurements are remarkably informative about the surroundings. Remote object material and temperature determine the spectrum of thermal radiance, and range, air temperature, and gas concentrations determine how this spectrum is modified by propagation to the sensor. We introduce a passive range imaging method based on computationally separating these phenomena. Previous methods assume hot and highly emitting objects; ranging is more challenging when objects’ temperatures do not deviate greatly from air temperature. Our method jointly estimates range and intrinsic object properties, with explicit consideration of air emission, though reflected light is assumed negligible. Inversion being underdetermined is mitigated by using a parametric model of atmospheric absorption and regularizing for smooth emissivity estimates. To assess where our estimate is likely accurate, we introduce a technique to detect which scene pixels are significantly influenced by reflected downwelling. Monte Carlo simulations demonstrate the importance of regularization, temperature differentials, and availability of many spectral bands. We apply our method to longwave infrared (8–13 <inline-formula><tex-math>$mathrm{mu }mathrm{m}$</tex-math></inline-formula>) hyperspectral image data acquired from natural scenes with no active illumination. Range features from 15 m to 150 m are recovered, with good qualitative match to lidar data for pixels classified as having negligible reflected downwelling.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4044-4060"},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143258463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Chen;Zhenyu Zhang;Weiqi Li;Chen Zhao;Jiwen Yu;Shijie Zhao;Jie Chen;Jian Zhang
{"title":"Invertible Diffusion Models for Compressed Sensing","authors":"Bin Chen;Zhenyu Zhang;Weiqi Li;Chen Zhao;Jiwen Yu;Shijie Zhao;Jie Chen;Jian Zhang","doi":"10.1109/TPAMI.2025.3538896","DOIUrl":"10.1109/TPAMI.2025.3538896","url":null,"abstract":"While deep neural networks (NNs) significantly advance image compressed sensing (CS) by improving reconstruction quality, the necessity of training current CS NNs from scratch constrains their effectiveness and hampers rapid deployment. Although recent methods utilize pre-trained diffusion models for image reconstruction, they struggle with slow inference and restricted adaptability to CS. To tackle these challenges, this paper proposes <bold>I</b>nvertible <bold>D</b>iffusion <bold>M</b>odels (<bold>IDM</b>), a novel efficient, end-to-end diffusion-based CS method. IDM repurposes a large-scale diffusion sampling process as a reconstruction model, and fine-tunes it end-to-end to recover original images directly from CS measurements, moving beyond the traditional paradigm of one-step noise estimation learning. To enable such memory-intensive end-to-end fine-tuning, we propose a novel two-level invertible design to transform both 1) multi-step sampling process and 2) noise estimation U-Net in each step into invertible networks. As a result, most intermediate features are cleared during training to reduce up to 93.8% GPU memory. In addition, we develop a set of lightweight modules to inject measurements into noise estimator to further facilitate reconstruction. Experiments demonstrate that IDM outperforms existing state-of-the-art CS networks by up to 2.64 dB in PSNR. Compared to the recent diffusion-based approach DDNM, our IDM achieves up to 10.09 dB PSNR gain and 14.54 times faster inference.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3992-4006"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enneng Yang;Li Shen;Zhenyi Wang;Shiwei Liu;Guibing Guo;Xingwei Wang;Dacheng Tao
{"title":"Revisiting Flatness-Aware Optimization in Continual Learning With Orthogonal Gradient Projection","authors":"Enneng Yang;Li Shen;Zhenyi Wang;Shiwei Liu;Guibing Guo;Xingwei Wang;Dacheng Tao","doi":"10.1109/TPAMI.2025.3539019","DOIUrl":"10.1109/TPAMI.2025.3539019","url":null,"abstract":"The goal of continual learning (CL) is to learn from a series of continuously arriving new tasks without forgetting previously learned old tasks. To avoid catastrophic forgetting of old tasks, orthogonal gradient projection (OGP) based CL methods constrain the gradients of new tasks to be orthogonal to the space spanned by old tasks. This strict gradient constraint will limit the learning ability of new tasks, resulting in lower performance on new tasks. In this paper, we first establish a unified framework for OGP-based CL methods. We then revisit OGP-based CL methods from a new perspective on the loss landscape, where we find that when relaxing projection constraints to improve performance on new tasks, the unflatness of the loss landscape can lead to catastrophic forgetting of old tasks. Based on our findings, we propose a new Dual Flatness-aware OGD framework that optimizes the flatness of the loss landscape from both data and weight levels. Our framework consists of three modules: data and weight perturbation, flatness-aware optimization, and gradient projection. Specifically, we first perform perturbations on the task's data and current model weights to make the task's loss reach the worst-case. Next, we optimize the loss and loss landscape on the original data and the worst-case perturbed data to obtain a flatness-aware gradient. Finally, the flatness-aware gradient will update the network in directions orthogonal to the space spanned by the old tasks. Extensive experiments on four benchmark datasets show that the framework improves the flatness of the loss landscape and performance on new tasks, and achieves state-of-the-art (SOTA) performance on average accuracy across all tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3895-3907"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fully-Connected Transformer for Multi-Source Image Fusion","authors":"Xiao Wu;Zi-Han Cao;Ting-Zhu Huang;Liang-Jian Deng;Jocelyn Chanussot;Gemine Vivone","doi":"10.1109/TPAMI.2024.3523364","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3523364","url":null,"abstract":"Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2071-2088"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fair Representation Learning for Continuous Sensitive Attributes Using Expectation of Integral Probability Metrics","authors":"Insung Kong;Kunwoong Kim;Yongdai Kim","doi":"10.1109/TPAMI.2025.3538915","DOIUrl":"10.1109/TPAMI.2025.3538915","url":null,"abstract":"AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and thus cannot be applied to continuous sensitive attributes, such as age or income. In this paper, we propose an FRL algorithm for continuous sensitive attributes. First, we introduce a measure called the Expectation of Integral Probability Metrics (EIPM) to assess the fairness level of representation space for continuous sensitive attributes. We demonstrate that if the distribution of the representation has a low EIPM value, then any prediction head constructed on the top of the representation become fair, regardless of the selection of the prediction head. Furthermore, EIPM possesses a distinguished advantage in that it can be accurately estimated using our proposed estimator with finite samples. Based on these properties, we propose a new FRL algorithm called Fair Representation using EIPM with MMD (FREM). Experimental evidences show that FREM outperforms other baseline methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3784-3795"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Bednarik;Noam Aigerman;Vladimir G. Kim;Siddhartha Chaudhuri;Shaifali Parashar;Mathieu Salzmann;Pascal Fua
{"title":"Temporally-Consistent Surface Reconstruction Using Metrically-Consistent Atlases","authors":"Jan Bednarik;Noam Aigerman;Vladimir G. Kim;Siddhartha Chaudhuri;Shaifali Parashar;Mathieu Salzmann;Pascal Fua","doi":"10.1109/TPAMI.2025.3538776","DOIUrl":"10.1109/TPAMI.2025.3538776","url":null,"abstract":"We propose a method for unsupervised reconstruction of a temporally-consistent sequence of surfaces from a sequence of time-evolving point clouds. It yields dense and semantically meaningful correspondences between frames. We represent the reconstructed surfaces as atlases computed by a neural network, which enables us to establish correspondences between frames. The key to making these correspondences semantically meaningful is to guarantee that the metric tensors computed at corresponding points are as similar as possible. We have devised an optimization strategy that makes our method robust to noise and global motions, without <italic>a priori</i> correspondences or pre-alignment steps. As a result, our approach outperforms state-of-the-art ones on several challenging datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3718-3730"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GaussNav: Gaussian Splatting for Visual Navigation","authors":"Xiaohan Lei;Min Wang;Wengang Zhou;Houqiang Li","doi":"10.1109/TPAMI.2025.3538496","DOIUrl":"10.1109/TPAMI.2025.3538496","url":null,"abstract":"In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary challenge of IIN arises from the need to recognize the target object across varying viewpoints while ignoring potential distractors. Existing map-based navigation methods typically use Bird’s Eye View (BEV) maps, which lack detailed texture representation of a scene. Consequently, while BEV maps are effective for semantic-level visual navigation, they are struggling for instance-level tasks. To this end, we propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The GaussNav framework enables the agent to memorize both the geometry and semantic information of the scene, as well as retain the textural features of objects. By matching renderings of similar objects with the target, the agent can accurately identify, ground, and navigate to the specified object. Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4108-4121"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quasi-Metric Learning for Bilateral Person-Job Fit","authors":"Yingpeng Du;Hongzhi Liu;Hengshu Zhu;Yang Song;Zhi Zheng;Zhonghai Wu","doi":"10.1109/TPAMI.2025.3538774","DOIUrl":"10.1109/TPAMI.2025.3538774","url":null,"abstract":"Matching suitable jobs with qualified candidates is crucial for online recruitment. Typically, users (i.e., candidates and employers) have specific expectations in the recruitment market, making them prefer similar jobs or candidates. Metric learning technologies provide a promising way to capture the similarity propagation between candidates and jobs. However, they rely on symmetric distance measures, failing to model users' asymmetric relationships in two-way selection. Additionally, users' behaviors (e.g., candidates) are highly affected by the feedback from their counterparts (e.g., employers), which can hardly be captured by the existing person-job fit methods that primarily explore homogeneous and undirected graphs. To address these problems, we propose a quasi-metric learning framework to capture the similarity propagation between candidates and jobs while modeling their asymmetric relations for bilateral person-job fit. Specifically, we propose a quasi-metric space that not only satisfies the triangle inequality to capture the fine-grained similarity between candidates and jobs, but also incorporates a tailored asymmetric measure to model users. two-way selection process in online recruitment. More importantly, the proposed quasi-metric learning framework can theoretically model recruitment rules from <italic>similarity</i> and <italic>competitiveness</i> perspectives, making it seamlessly align with bilateral person-job fit scenarios. To explore the mutual effects of two-sided users, we first organize candidates, employers, and their different-typed interactions into a heterogeneous relation graph, and then propose a relation-aware graph convolution network to capture users. mutual effects through their bilateral behaviors. Extensive experiments on several real-world datasets demonstrate the effectiveness of the proposed methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3947-3960"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}