Unay Dorken Gallastegi;Hoover Rueda-Chacón;Martin J. Stevens;Vivek K Goyal
{"title":"Absorption-Based, Passive Range Imaging From Hyperspectral Thermal Measurements","authors":"Unay Dorken Gallastegi;Hoover Rueda-Chacón;Martin J. Stevens;Vivek K Goyal","doi":"10.1109/TPAMI.2025.3538711","DOIUrl":"10.1109/TPAMI.2025.3538711","url":null,"abstract":"Passive hyperspectral longwave infrared measurements are remarkably informative about the surroundings. Remote object material and temperature determine the spectrum of thermal radiance, and range, air temperature, and gas concentrations determine how this spectrum is modified by propagation to the sensor. We introduce a passive range imaging method based on computationally separating these phenomena. Previous methods assume hot and highly emitting objects; ranging is more challenging when objects’ temperatures do not deviate greatly from air temperature. Our method jointly estimates range and intrinsic object properties, with explicit consideration of air emission, though reflected light is assumed negligible. Inversion being underdetermined is mitigated by using a parametric model of atmospheric absorption and regularizing for smooth emissivity estimates. To assess where our estimate is likely accurate, we introduce a technique to detect which scene pixels are significantly influenced by reflected downwelling. Monte Carlo simulations demonstrate the importance of regularization, temperature differentials, and availability of many spectral bands. We apply our method to longwave infrared (8–13 <inline-formula><tex-math>$mathrm{mu }mathrm{m}$</tex-math></inline-formula>) hyperspectral image data acquired from natural scenes with no active illumination. Range features from 15 m to 150 m are recovered, with good qualitative match to lidar data for pixels classified as having negligible reflected downwelling.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4044-4060"},"PeriodicalIF":0.0,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143258463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Chen;Zhenyu Zhang;Weiqi Li;Chen Zhao;Jiwen Yu;Shijie Zhao;Jie Chen;Jian Zhang
{"title":"Invertible Diffusion Models for Compressed Sensing","authors":"Bin Chen;Zhenyu Zhang;Weiqi Li;Chen Zhao;Jiwen Yu;Shijie Zhao;Jie Chen;Jian Zhang","doi":"10.1109/TPAMI.2025.3538896","DOIUrl":"10.1109/TPAMI.2025.3538896","url":null,"abstract":"While deep neural networks (NNs) significantly advance image compressed sensing (CS) by improving reconstruction quality, the necessity of training current CS NNs from scratch constrains their effectiveness and hampers rapid deployment. Although recent methods utilize pre-trained diffusion models for image reconstruction, they struggle with slow inference and restricted adaptability to CS. To tackle these challenges, this paper proposes <bold>I</b>nvertible <bold>D</b>iffusion <bold>M</b>odels (<bold>IDM</b>), a novel efficient, end-to-end diffusion-based CS method. IDM repurposes a large-scale diffusion sampling process as a reconstruction model, and fine-tunes it end-to-end to recover original images directly from CS measurements, moving beyond the traditional paradigm of one-step noise estimation learning. To enable such memory-intensive end-to-end fine-tuning, we propose a novel two-level invertible design to transform both 1) multi-step sampling process and 2) noise estimation U-Net in each step into invertible networks. As a result, most intermediate features are cleared during training to reduce up to 93.8% GPU memory. In addition, we develop a set of lightweight modules to inject measurements into noise estimator to further facilitate reconstruction. Experiments demonstrate that IDM outperforms existing state-of-the-art CS networks by up to 2.64 dB in PSNR. Compared to the recent diffusion-based approach DDNM, our IDM achieves up to 10.09 dB PSNR gain and 14.54 times faster inference.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3992-4006"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enneng Yang;Li Shen;Zhenyi Wang;Shiwei Liu;Guibing Guo;Xingwei Wang;Dacheng Tao
{"title":"Revisiting Flatness-Aware Optimization in Continual Learning With Orthogonal Gradient Projection","authors":"Enneng Yang;Li Shen;Zhenyi Wang;Shiwei Liu;Guibing Guo;Xingwei Wang;Dacheng Tao","doi":"10.1109/TPAMI.2025.3539019","DOIUrl":"10.1109/TPAMI.2025.3539019","url":null,"abstract":"The goal of continual learning (CL) is to learn from a series of continuously arriving new tasks without forgetting previously learned old tasks. To avoid catastrophic forgetting of old tasks, orthogonal gradient projection (OGP) based CL methods constrain the gradients of new tasks to be orthogonal to the space spanned by old tasks. This strict gradient constraint will limit the learning ability of new tasks, resulting in lower performance on new tasks. In this paper, we first establish a unified framework for OGP-based CL methods. We then revisit OGP-based CL methods from a new perspective on the loss landscape, where we find that when relaxing projection constraints to improve performance on new tasks, the unflatness of the loss landscape can lead to catastrophic forgetting of old tasks. Based on our findings, we propose a new Dual Flatness-aware OGD framework that optimizes the flatness of the loss landscape from both data and weight levels. Our framework consists of three modules: data and weight perturbation, flatness-aware optimization, and gradient projection. Specifically, we first perform perturbations on the task's data and current model weights to make the task's loss reach the worst-case. Next, we optimize the loss and loss landscape on the original data and the worst-case perturbed data to obtain a flatness-aware gradient. Finally, the flatness-aware gradient will update the network in directions orthogonal to the space spanned by the old tasks. Extensive experiments on four benchmark datasets show that the framework improves the flatness of the loss landscape and performance on new tasks, and achieves state-of-the-art (SOTA) performance on average accuracy across all tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3895-3907"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fully-Connected Transformer for Multi-Source Image Fusion","authors":"Xiao Wu;Zi-Han Cao;Ting-Zhu Huang;Liang-Jian Deng;Jocelyn Chanussot;Gemine Vivone","doi":"10.1109/TPAMI.2024.3523364","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3523364","url":null,"abstract":"Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2071-2088"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fair Representation Learning for Continuous Sensitive Attributes Using Expectation of Integral Probability Metrics","authors":"Insung Kong;Kunwoong Kim;Yongdai Kim","doi":"10.1109/TPAMI.2025.3538915","DOIUrl":"10.1109/TPAMI.2025.3538915","url":null,"abstract":"AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and thus cannot be applied to continuous sensitive attributes, such as age or income. In this paper, we propose an FRL algorithm for continuous sensitive attributes. First, we introduce a measure called the Expectation of Integral Probability Metrics (EIPM) to assess the fairness level of representation space for continuous sensitive attributes. We demonstrate that if the distribution of the representation has a low EIPM value, then any prediction head constructed on the top of the representation become fair, regardless of the selection of the prediction head. Furthermore, EIPM possesses a distinguished advantage in that it can be accurately estimated using our proposed estimator with finite samples. Based on these properties, we propose a new FRL algorithm called Fair Representation using EIPM with MMD (FREM). Experimental evidences show that FREM outperforms other baseline methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3784-3795"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143191825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Bednarik;Noam Aigerman;Vladimir G. Kim;Siddhartha Chaudhuri;Shaifali Parashar;Mathieu Salzmann;Pascal Fua
{"title":"Temporally-Consistent Surface Reconstruction Using Metrically-Consistent Atlases","authors":"Jan Bednarik;Noam Aigerman;Vladimir G. Kim;Siddhartha Chaudhuri;Shaifali Parashar;Mathieu Salzmann;Pascal Fua","doi":"10.1109/TPAMI.2025.3538776","DOIUrl":"10.1109/TPAMI.2025.3538776","url":null,"abstract":"We propose a method for unsupervised reconstruction of a temporally-consistent sequence of surfaces from a sequence of time-evolving point clouds. It yields dense and semantically meaningful correspondences between frames. We represent the reconstructed surfaces as atlases computed by a neural network, which enables us to establish correspondences between frames. The key to making these correspondences semantically meaningful is to guarantee that the metric tensors computed at corresponding points are as similar as possible. We have devised an optimization strategy that makes our method robust to noise and global motions, without <italic>a priori</i> correspondences or pre-alignment steps. As a result, our approach outperforms state-of-the-art ones on several challenging datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3718-3730"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GaussNav: Gaussian Splatting for Visual Navigation","authors":"Xiaohan Lei;Min Wang;Wengang Zhou;Houqiang Li","doi":"10.1109/TPAMI.2025.3538496","DOIUrl":"10.1109/TPAMI.2025.3538496","url":null,"abstract":"In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary challenge of IIN arises from the need to recognize the target object across varying viewpoints while ignoring potential distractors. Existing map-based navigation methods typically use Bird’s Eye View (BEV) maps, which lack detailed texture representation of a scene. Consequently, while BEV maps are effective for semantic-level visual navigation, they are struggling for instance-level tasks. To this end, we propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The GaussNav framework enables the agent to memorize both the geometry and semantic information of the scene, as well as retain the textural features of objects. By matching renderings of similar objects with the target, the agent can accurately identify, ground, and navigate to the specified object. Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4108-4121"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection","authors":"Yuming Chen;Xinbin Yuan;Jiabao Wang;Ruiqi Wu;Xiang Li;Qibin Hou;Ming-Ming Cheng","doi":"10.1109/TPAMI.2025.3538473","DOIUrl":"10.1109/TPAMI.2025.3538473","url":null,"abstract":"We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how multi-branch features of the basic block and convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can significantly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our work, we train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7, RTMDet, and YOLO-v8. Taking the XS version of YOLO-MS as an example, it can achieve an AP score of 42+% on MS COCO, which is about 2% higher than RTMDet with the same model size. Furthermore, our work can also serve as a plug-and-play module for other YOLO models. Typically, our method significantly advances the APs, APl, and AP of YOLOv8-N from 18%+, 52%+, and 37%+ to 20%+, 55%+, and 40%+, respectively, with even fewer parameters and MACs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4240-4252"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quasi-Metric Learning for Bilateral Person-Job Fit","authors":"Yingpeng Du;Hongzhi Liu;Hengshu Zhu;Yang Song;Zhi Zheng;Zhonghai Wu","doi":"10.1109/TPAMI.2025.3538774","DOIUrl":"10.1109/TPAMI.2025.3538774","url":null,"abstract":"Matching suitable jobs with qualified candidates is crucial for online recruitment. Typically, users (i.e., candidates and employers) have specific expectations in the recruitment market, making them prefer similar jobs or candidates. Metric learning technologies provide a promising way to capture the similarity propagation between candidates and jobs. However, they rely on symmetric distance measures, failing to model users' asymmetric relationships in two-way selection. Additionally, users' behaviors (e.g., candidates) are highly affected by the feedback from their counterparts (e.g., employers), which can hardly be captured by the existing person-job fit methods that primarily explore homogeneous and undirected graphs. To address these problems, we propose a quasi-metric learning framework to capture the similarity propagation between candidates and jobs while modeling their asymmetric relations for bilateral person-job fit. Specifically, we propose a quasi-metric space that not only satisfies the triangle inequality to capture the fine-grained similarity between candidates and jobs, but also incorporates a tailored asymmetric measure to model users. two-way selection process in online recruitment. More importantly, the proposed quasi-metric learning framework can theoretically model recruitment rules from <italic>similarity</i> and <italic>competitiveness</i> perspectives, making it seamlessly align with bilateral person-job fit scenarios. To explore the mutual effects of two-sided users, we first organize candidates, employers, and their different-typed interactions into a heterogeneous relation graph, and then propose a relation-aware graph convolution network to capture users. mutual effects through their bilateral behaviors. Extensive experiments on several real-world datasets demonstrate the effectiveness of the proposed methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3947-3960"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transferable Unintentional Action Localization With Language-Guided Intention Translation","authors":"Jinglin Xu;Yongming Rao;Jie Zhou;Jiwen Lu","doi":"10.1109/TPAMI.2025.3538675","DOIUrl":"10.1109/TPAMI.2025.3538675","url":null,"abstract":"Unintentional action localization (UAL) is a challenging task that requires reasoning about action intention clues to detect the temporal locations of unintentional action occurrences in real-world videos. Previous efforts usually treated this task as a dense binary classification problem and did not fully explore the relationships between intention clues and unintentional actions, resulting in unsatisfactory performance on open-set scenarios during inference. In this paper, we propose a Transferable Unintentional Action Localization framework by introducing language-guided intention translation, which explicitly formulates unintentional action localization as an open-set localization problem. Our framework constructs a transferable reasoning model guided by natural languages to translate the action intention of the entire video, which generates natural and powerful supervision signals for reconstructing complete action intention clues to address the problem of unintentional action localization. Based on the fact that a video with failure action is composed of intentional and unintentional parts connected by a transient action transition. Our transferable reasoning model employs a transformer architecture to transfer knowledge between intentional and unintentional parts for learning complementary semantic representations of these two parts, completing the action intention clue in an implicit supervision manner. We also present a dense voting scheme for detecting the action transition from intentional to unintentional using discriminative representations incorporating action intention clues. Extensive experiments demonstrate that our framework outperforms representative unintentional action localization methods in a wide range of open-set scenarios. In addition, we create a new unintentional sports video dataset, FS-Falls, and extend our framework from in-the-wild scenarios to competitive sports to demonstrate better generalization ability. We hope this work will provide a new perspective on creating powerful representations with complete action intention priors, which will help us better understand human action and capture underlying intention clues in real-world videos.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3863-3877"},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}