{"title":"FSD V2: Improving Fully Sparse 3D Object Detection With Virtual Voxels","authors":"Lue Fan;Feng Wang;Naiyan Wang;Zhaoxiang Zhang","doi":"10.1109/TPAMI.2024.3502456","DOIUrl":"10.1109/TPAMI.2024.3502456","url":null,"abstract":"LiDAR-based fully sparse architecture has gained increasing attention. FSDv1 stands out as a representative work, achieving impressive efficacy and efficiency, albeit with intricate structures and handcrafted designs. In this paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1 and eliminate the ad-hoc heuristics in its handcrafted instance-level representation, thus promoting better universality. To this end, we introduce \u0000<italic>virtual voxels</i>\u0000, taking over the clustering-based instance segmentation in FSDv1. Virtual voxels not only address the notorious issue of the Center Feature Missing in fully sparse detectors but also endow the framework with a more elegant and streamlined approach. Besides, we develop a suite of components to complement the virtual voxel mechanism, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy. We conduct experiments on three large-scale datasets: \u0000<italic>Waymo Open Dataset</i>\u0000, \u0000<italic>Argoverse 2</i>\u0000 dataset, and \u0000<italic>nuScenes</i>\u0000 dataset. Our results showcase state-of-the-art performance on all three datasets, highlighting the superiority of FSDv2 in long-range scenarios and its universality in achieving competitive performance across diverse scenarios. Moreover, we provide comprehensive experimental analysis to understand the workings of FSDv2.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1279-1292"},"PeriodicalIF":0.0,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142673351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Event-Enhanced Snapshot Compressive Videography at 10K FPS","authors":"Bo Zhang;Jinli Suo;Qionghai Dai","doi":"10.1109/TPAMI.2024.3496788","DOIUrl":"10.1109/TPAMI.2024.3496788","url":null,"abstract":"Video snapshot compressive imaging (SCI) encodes the target dynamic scene compactly into a snapshot and reconstructs its high-speed frame sequence afterward, greatly reducing the required data footprint and transmission bandwidth as well as enabling high-speed imaging with a low frame rate intensity camera. In implementation, high-speed dynamics are encoded via temporally varying patterns, and only frames at corresponding temporal intervals can be reconstructed, while the dynamics occurring between consecutive frames are lost. To unlock the potential of conventional snapshot compressive videography, we propose a novel hybrid “intensity\u0000<inline-formula><tex-math>$+$</tex-math></inline-formula>\u0000 event imaging scheme by incorporating an event camera into a video SCI setup. Our proposed system consists of a dual-path optical setup to record the coded intensity measurement and intermediate event signals simultaneously, which is compact and photon-efficient by collecting the half photons discarded in conventional video SCI. Correspondingly, we developed a dual-branch Transformer utilizing the reciprocal relationship between two data modes to decode dense video frames. Extensive experiments on both simulated and real-captured data demonstrate our superiority to state-of-the-art video SCI and video frame interpolation (VFI) methods. Benefiting from the new hybrid design leveraging both intrinsic redundancy in videos and the unique feature of event cameras, we achieve high-quality videography at 0.1ms time intervals with a low-cost CMOS image sensor working at 24 FPS.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1266-1278"},"PeriodicalIF":0.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142599240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Gan;Xiang-xiang Su;Guang-yong Chen;Jing Chen;C. L. Philip Chen
{"title":"Online Learning Under a Separable Stochastic Approximation Framework","authors":"Min Gan;Xiang-xiang Su;Guang-yong Chen;Jing Chen;C. L. Philip Chen","doi":"10.1109/TPAMI.2024.3495783","DOIUrl":"10.1109/TPAMI.2024.3495783","url":null,"abstract":"We propose an online learning algorithm tailored for a class of machine learning models within a separable stochastic approximation framework. The central idea of our approach is to exploit the inherent separability in many models, recognizing that certain parameters are easier to optimize than others. This paper focuses on models where some parameters exhibit linear characteristics, which are common in machine learning applications. In our proposed algorithm, the linear parameters are updated using the recursive least squares (RLS) algorithm, akin to a stochastic Newton method. Subsequently, based on these updated linear parameters, the nonlinear parameters are adjusted using the stochastic gradient method (SGD). This dual-update mechanism can be viewed as a stochastic approximation variant of block coordinate gradient descent, where one subset of parameters is optimized using a second-order method while the other is handled with a first-order approach. We establish the global convergence of our online algorithm for non-convex cases in terms of the expected violation of first-order optimality conditions. Numerical experiments demonstrate that our method achieves significantly faster initial convergence and produces more robust performance compared to other popular learning algorithms. Additionally, our algorithm exhibits reduced sensitivity to learning rates and outperforms the recently proposed \u0000<monospace>slimTrain</monospace>\u0000 algorithm (Newman et al. 2022). For validation, the code has been made available on GitHub.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1317-1330"},"PeriodicalIF":0.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142599238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valero Laparra;Juan Emmanuel Johnson;Gustau Camps-Valls;Raúl Santos-Rodríguez;Jesús Malo
{"title":"Estimating Information Theoretic Measures via Multidimensional Gaussianization","authors":"Valero Laparra;Juan Emmanuel Johnson;Gustau Camps-Valls;Raúl Santos-Rodríguez;Jesús Malo","doi":"10.1109/TPAMI.2024.3495827","DOIUrl":"10.1109/TPAMI.2024.3495827","url":null,"abstract":"Information theory is an outstanding framework for measuring uncertainty, dependence, and relevance in data and systems. It has several desirable properties for real-world applications: naturally deals with multivariate data, can handle heterogeneous data, and the measures can be interpreted. However, it has not been adopted by a wider audience because obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality. We propose an indirect way of estimating information based on a multivariate iterative Gaussianization transform. The proposed method has a multivariate-to-univariate property: it reduces the \u0000<italic>challenging</i>\u0000 estimation of multivariate measures to a composition of \u0000<italic>marginal</i>\u0000 operations applied in each iteration of the Gaussianization. Therefore, the convergence of the resulting estimates depends on the convergence of well-understood univariate entropy estimates, and the global error linearly depends on the number of times the marginal estimator is invoked. We introduce Gaussianization-based estimates for Total Correlation, Entropy, Mutual Information, and Kullback-Leibler Divergence. Results on artificial data show that our approach is superior to previous estimators, particularly in high-dimensional scenarios. We also illustrate the method's performance in different fields to obtain interesting insights. We make the tools and datasets publicly available to provide a test bed for analyzing future methodologies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1293-1308"},"PeriodicalIF":0.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142599243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics","authors":"Alessandra Carbone;Aurélien Decelle;Lorenzo Rosset;Beatriz Seoane","doi":"10.1109/TPAMI.2024.3495999","DOIUrl":"10.1109/TPAMI.2024.3495999","url":null,"abstract":"In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied to the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to five different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, homologous RNA sequences from specific taxonomies and real classical piano pieces classified by their composer.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1309-1316"},"PeriodicalIF":0.0,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142599242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimum Latency Deep Online Video Stabilization and Its Extensions","authors":"Shuaicheng Liu;Zhuofan Zhang;Zhen Liu;Ping Tan;Bing Zeng","doi":"10.1109/TPAMI.2024.3493175","DOIUrl":"10.1109/TPAMI.2024.3493175","url":null,"abstract":"We present a novel deep camera path optimization framework for minimum latency online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimation, path smoothing, and novel view synthesis. Most previous methods concentrate on motion estimation while path optimization receives less attention, particularly in the crucial online setting where future frames are inaccessible. In this work, we adopt off-the-shelf high-quality deep motion models for motion estimation and focus only on the path optimization. Specifically, our camera path smoothing network takes a short 2D camera path in a sliding window as input and outputs the stabilizing warp field of the last frame, which warps the coming frame to its stabilized position. We explore three motion densities: a global single camera path, local mesh-based bundled paths, and dense flow paths. A hybrid loss and an efficient motion smoothing attention (EMSA) module are proposed for spatially and temporally consistent path smoothing. Moreover, we build a motion dataset that contains stable and unstable motion pairs for training. Extensive experiments demonstrate that our method surpasses state-of-the-art online stabilization methods and rivals the performance of offline methods, offering compelling advancements in the field of video stabilization.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1238-1249"},"PeriodicalIF":0.0,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142596563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning","authors":"Yifan Zhao;Jia Li;Zeyin Song;Yonghong Tian","doi":"10.1109/TPAMI.2024.3492328","DOIUrl":"10.1109/TPAMI.2024.3492328","url":null,"abstract":"Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over 13% and 7% on the final session of \u0000<italic>mini</i>\u0000ImageNet and CIFAR-100 FSCIL benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1089-1102"},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142591827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yipo Huang;Leida Li;Pengfei Chen;Haoning Wu;Weisi Lin;Guangming Shi
{"title":"Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing","authors":"Yipo Huang;Leida Li;Pengfei Chen;Haoning Wu;Weisi Lin;Guangming Shi","doi":"10.1109/TPAMI.2024.3492259","DOIUrl":"10.1109/TPAMI.2024.3492259","url":null,"abstract":"In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. 1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. 2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1205-1218"},"PeriodicalIF":0.0,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142591892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anti-Forgetting Adaptation for Unsupervised Person Re-Identification","authors":"Hao Chen;Francois Bremond;Nicu Sebe;Shiliang Zhang","doi":"10.1109/TPAMI.2024.3490777","DOIUrl":"10.1109/TPAMI.2024.3490777","url":null,"abstract":"Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1056-1072"},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evolved Hierarchical Masking for Self-Supervised Learning","authors":"Zhanzhou Feng;Shiliang Zhang","doi":"10.1109/TPAMI.2024.3490776","DOIUrl":"10.1109/TPAMI.2024.3490776","url":null,"abstract":"Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 2","pages":"1013-1027"},"PeriodicalIF":0.0,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142577352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}