Jian-Nan Su;Guodong Fan;Min Gan;Guang-Yong Chen;Wenzhong Guo;C. L. Philip Chen
{"title":"Revealing the Dark Side of Non-Local Attention in Single Image Super-Resolution","authors":"Jian-Nan Su;Guodong Fan;Min Gan;Guang-Yong Chen;Wenzhong Guo;C. L. Philip Chen","doi":"10.1109/TPAMI.2024.3457790","DOIUrl":"10.1109/TPAMI.2024.3457790","url":null,"abstract":"Single Image Super-Resolution (SISR) aims to reconstruct a high-resolution image from its corresponding low-resolution input. A common technique to enhance the reconstruction quality is Non-Local Attention (NLA), which leverages self-similar texture patterns in images. However, we have made a novel finding that challenges the prevailing wisdom. Our research reveals that NLA can be detrimental to SISR and even produce severely distorted textures. For example, when dealing with severely degrade textures, NLA may generate unrealistic results due to the inconsistency of non-local texture patterns. This problem is overlooked by existing works, which only measure the average reconstruction quality of the whole image, without considering the potential risks of using NLA. To address this issue, we propose a new perspective for evaluating the reconstruction quality of NLA, by focusing on the sub-pixel level that matches the pixel-wise fusion manner of NLA. From this perspective, we provide the approximate reconstruction performance upper bound of NLA, which guides us to design a concise yet effective Texture-Fidelity Strategy (TFS) to mitigate the degradation caused by NLA. Moreover, the proposed TFS can be conveniently integrated into existing NLA-based SISR models as a general building block. Based on the TFS, we develop a Deep Texture-Fidelity Network (DTFN), which achieves state-of-the-art performance for SISR. Our code and a pre-trained DTFN are available on GitHub\u0000<sup>†</sup>\u0000 for verification.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11476-11490"},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142166412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yudi Dai;Zhiyong Wang;Xiping Lin;Chenglu Wen;Lan Xu;Siqi Shen;Yuexin Ma;Cheng Wang
{"title":"HiSC4D: Human-Centered Interaction and 4D Scene Capture in Large-Scale Space Using Wearable IMUs and LiDAR","authors":"Yudi Dai;Zhiyong Wang;Xiping Lin;Chenglu Wen;Lan Xu;Siqi Shen;Yuexin Ma;Cheng Wang","doi":"10.1109/TPAMI.2024.3457229","DOIUrl":"10.1109/TPAMI.2024.3457229","url":null,"abstract":"We introduce HiSC4D, a novel \u0000<b>H</b>\u0000uman-centered \u0000<b>i</b>\u0000nteraction and \u0000<b>4D</b>\u0000 \u0000<b>S</b>\u0000cene \u0000<b>C</b>\u0000apture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 \u0000<inline-formula><tex-math>$text{m}^{2}$</tex-math></inline-formula>\u0000), providing 36 k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicly available for research purposes.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11236-11253"},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142166411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One-Stage Anchor-Free Online Multiple Target Tracking With Deformable Local Attention and Task-Aware Prediction","authors":"Weiming Hu;Shaoru Wang;Zongwei Zhou;Jin Gao;Yangxi Li;Stephen Maybank","doi":"10.1109/TPAMI.2024.3457886","DOIUrl":"10.1109/TPAMI.2024.3457886","url":null,"abstract":"The tracking-by-detection paradigm currently dominates multiple target tracking algorithms. It usually includes three tasks: target detection, appearance feature embedding, and data association. Carrying out these three tasks successively usually leads to lower tracking efficiency. In this paper, we propose a one-stage anchor-free multiple task learning framework which carries out target detection and appearance feature embedding in parallel to substantially increase the tracking speed. This framework simultaneously predicts a target detection and produces a feature embedding for each location, by sharing a pyramid of feature maps. We propose a deformable local attention module which utilizes the correlations between features at different locations within a target to obtain more discriminative features. We further propose a task-aware prediction module which utilizes deformable convolutions to select the most suitable locations for the different tasks. At the selected locations, classification of samples into foreground or background, appearance feature embedding, and target box regression are carried out. Two effective training strategies, regression range overlapping and sample reweighting, are proposed to reduce missed detections in dense scenes. Ambiguous samples whose identities are difficult to determine are effectively dealt with to obtain more accurate feature embedding of target appearance. An appearance-enhanced non-maximum suppression is proposed to reduce over-suppression of true targets in crowded scenes. Based on the one-stage anchor-free network with the deformable local attention module and the task-aware prediction module, we implement a new online multiple target tracker. Experimental results show that our tracker achieves a very fast speed while maintaining a high tracking accuracy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11446-11463"},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142166410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"R$^{3}$3LIVE++: A Robust, Real-Time, Radiance Reconstruction Package With a Tightly-Coupled LiDAR-Inertial-Visual State Estimator","authors":"Jiarong Lin;Fu Zhang","doi":"10.1109/TPAMI.2024.3456473","DOIUrl":"10.1109/TPAMI.2024.3456473","url":null,"abstract":"This work proposed a LiDAR-inertial-visual fusion framework termed R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE++ to achieve robust and accurate state estimation while simultaneously reconstructing the radiance map on the fly. R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE++ consists of a LiDAR-inertial odometry (LIO) and a visual-inertial odometry (VIO), both running in real-time. The LIO subsystem utilizes the measurements from a LiDAR for reconstructing the geometric structure, while the VIO subsystem simultaneously recovers the radiance information of the geometric structure from the input images. R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE++ is developed based on R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE and further improves the accuracy in localization and mapping by accounting for the camera photometric calibration and the online estimation of camera exposure time. We conduct more extensive experiments on public and self-collected datasets to compare our proposed system against other state-of-the-art SLAM systems. Quantitative and qualitative results show that R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE++ has significant improvements over others in both accuracy and robustness. Moreover, to demonstrate the extendability of R\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000LIVE++, we developed several applications based on our reconstructed maps, such as high dynamic range (HDR) imaging, virtual environment exploration, and 3D video gaming.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11168-11185"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vision-Centric BEV Perception: A Survey","authors":"Yuexin Ma;Tai Wang;Xuyang Bai;Huitong Yang;Yuenan Hou;Yaming Wang;Yu Qiao;Ruigang Yang;Xinge Zhu","doi":"10.1109/TPAMI.2024.3449912","DOIUrl":"10.1109/TPAMI.2024.3449912","url":null,"abstract":"In recent years, vision-centric Bird’s Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion. The rapid advancements in deep learning have led to the proposal of numerous methods for addressing vision-centric BEV perception challenges. However, there has been no recent survey encompassing this novel and burgeoning research field. To catalyze future research, this paper presents a comprehensive survey of the latest developments in vision-centric BEV perception and its extensions. It compiles and organizes up-to-date knowledge, offering a systematic review and summary of prevalent algorithms. Additionally, the paper provides in-depth analyses and comparative results on various BEV perception tasks, facilitating the evaluation of future works and sparking new research directions. Furthermore, the paper discusses and shares valuable empirical implementation details to aid in the advancement of related algorithms.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10978-10997"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MINN: Learning the Dynamics of Differential-Algebraic Equations and Application to Battery Modeling","authors":"Yicun Huang;Changfu Zou;Yang Li;Torsten Wik","doi":"10.1109/TPAMI.2024.3456475","DOIUrl":"10.1109/TPAMI.2024.3456475","url":null,"abstract":"The concept of integrating physics-based and data-driven approaches has become popular for modeling sustainable energy systems. However, the existing literature mainly focuses on the data-driven surrogates generated to replace physics-based models. These models often trade accuracy for speed but lack the generalizability, adaptability, and interpretability inherent in physics-based models, which are often indispensable in modeling real-world dynamic systems for optimization and control purposes. We propose a novel machine learning architecture, termed model-integrated neural networks (MINN), that can learn the physics-based dynamics of general autonomous or non-autonomous systems consisting of partial differential-algebraic equations (PDAEs). The obtained architecture systematically solves an unsettled research problem in control-oriented modeling, i.e., how to obtain optimally simplified models that are physically insightful, numerically accurate, and computationally tractable simultaneously. We apply the proposed neural network architecture to model the electrochemical dynamics of lithium-ion batteries and show that MINN is extremely data-efficient to train while being sufficiently generalizable to previously unseen input data, owing to its underlying physical invariants. The MINN battery model has an accuracy comparable to the first principle-based model in predicting both the system outputs and any locally distributed electrochemical behaviors but achieves two orders of magnitude reduction in the solution time.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11331-11344"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bohao Qu;Xiaofeng Cao;Yi Chang;Ivor W. Tsang;Yew-Soon Ong
{"title":"Diversifying Policies With Non-Markov Dispersion to Expand the Solution Space","authors":"Bohao Qu;Xiaofeng Cao;Yi Chang;Ivor W. Tsang;Yew-Soon Ong","doi":"10.1109/TPAMI.2024.3455257","DOIUrl":"10.1109/TPAMI.2024.3455257","url":null,"abstract":"Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11392-11408"},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation","authors":"Qiming Liu;Haoran Xin;Zhe Liu;Hesheng Wang","doi":"10.1109/TPAMI.2024.3455252","DOIUrl":"10.1109/TPAMI.2024.3455252","url":null,"abstract":"We propose an end-to-end visuomotor navigation framework that leverages Neural Radiance Fields (NeRF) for spatial cognition. To the best of our knowledge, this is the first effort to integrate such implicit spatial representation with embodied policy end-to-end for cognitive decision-making. Consequently, our system does not necessitate modularized designs nor transformations into explicit scene representations for downstream control. The NeRF-based memory is constructed online during navigation, without relying on any environmental priors. To enhance the extraction of decision-critical historical insights from the rigid and implicit structure of NeRF, we introduce a spatial information extraction mechanism named Structural Radiance Attention (SRA). SRA empowers the agent to grasp complex scene structures and task objectives, thus paving the way for the development of intelligent behavioral patterns. Our comprehensive testing in image-goal navigation tasks demonstrates that our approach significantly outperforms existing navigation models. We demonstrate that SRA markedly improves the agent's understanding of both the scene and the task by retrieving historical information stored in NeRF memory. The agent also learns exploratory awareness from our pipeline to better adapt to low signal-to-noise memory signals in unknown scenes. We deploy our navigation system on a mobile robot in real-world scenarios, where it exhibits evident cognitive capabilities while ensuring real-time performance.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11200-11215"},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Xu;Congyu Qiao;Yuchen Zhao;Xin Geng;Min-Ling Zhang
{"title":"Variational Label Enhancement for Instance-Dependent Partial Label Learning","authors":"Ning Xu;Congyu Qiao;Yuchen Zhao;Xin Geng;Min-Ling Zhang","doi":"10.1109/TPAMI.2024.3455260","DOIUrl":"10.1109/TPAMI.2024.3455260","url":null,"abstract":"Partial label learning (PLL) is a form of weakly supervised learning, where each training example is linked to a set of candidate labels, among which only one label is correct. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels. However, in practice, this assumption may not hold true, as the candidate labels are often instance-dependent. In this paper, we address the instance-dependent PLL problem and assume that each example is associated with a latent \u0000<italic>label distribution</i>\u0000 where the incorrect label with a high degree is more likely to be annotated as a candidate label. Motivated by this consideration, we propose two methods \u0000<sc>Valen</small>\u0000 and \u0000<sc>Milen</small>\u0000, which train the predictive model via utilizing the latent label distributions recovered by the label enhancement process. Specifically, \u0000<sc>Valen</small>\u0000 recovers the latent label distributions via inferring the variational posterior density parameterized by an inference model with the deduced evidence lower bound. \u0000<sc>Milen</small>\u0000 recovers the latent label distribution by adopting the variational approximation to bound the mutual information among the latent label distribution, observed labels and augmented instances. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11298-11313"},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation","authors":"Jingyao Li;Pengguang Chen;Shengju Qian;Shu Liu;Jiaya Jia","doi":"10.1109/TPAMI.2024.3454647","DOIUrl":"10.1109/TPAMI.2024.3454647","url":null,"abstract":"Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, \u0000<bold>TagCLIP</b>\u0000 (\u0000<bold>T</b>\u0000rusty-\u0000<bold>a</b>\u0000ware \u0000<bold>g</b>\u0000uided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012 and COCO-Stuff 164 K. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4% and 1.7%, respectively, with negligible overheads. The code is available at here.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11287-11297"},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}