Haeyong Kang;Jaehong Yoon;Sung Ju Hwang;Chang D. Yoo
{"title":"Continual Learning: Forget-Free Winning Subnetworks for Video Representations","authors":"Haeyong Kang;Jaehong Yoon;Sung Ju Hwang;Chang D. Yoo","doi":"10.1109/TPAMI.2024.3518588","DOIUrl":"10.1109/TPAMI.2024.3518588","url":null,"abstract":"Inspired by the Lottery Ticket Hypothesis (LTH), which highlights the existence of efficient subnetworks within larger, dense networks, a high-performing Winning Subnetwork (WSN) in terms of task performance under appropriate sparsity conditions is considered for various continual learning tasks. It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios. In Few-Shot Class Incremental Learning (FSCIL), a variation of WSN referred to as the Soft subnetwork (SoftNet) is designed to prevent overfitting when the data samples are scarce. Furthermore, the sparse reuse of WSN weights is considered for Video Incremental Learning (VIL). The use of Fourier Subneural Operator (FSO) within WSN is considered. It enables compact encoding of videos and identifies reusable subnetworks across varying bandwidths. We have integrated FSO into different architectural frameworks for continual learning, including VIL, TIL, and FSCIL. Our comprehensive experiments demonstrate FSO's effectiveness, significantly improving task performance at various convolutional representational levels. Specifically, FSO enhances higher-layer performance in TIL and FSCIL and lower-layer performance in VIL.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2140-2156"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyu Guan;Wanqing Zhao;Hongmin Liu;Yuta Nakashima;Noboru Babaguchi;Xiaofei He
{"title":"Cross-Modal Guided Visual Representation Learning for Social Image Retrieval","authors":"Ziyu Guan;Wanqing Zhao;Hongmin Liu;Yuta Nakashima;Noboru Babaguchi;Xiaofei He","doi":"10.1109/TPAMI.2024.3519112","DOIUrl":"10.1109/TPAMI.2024.3519112","url":null,"abstract":"Social images are often associated with rich but noisy tags from community contributions. Although social tags can potentially provide valuable semantic training information for image retrieval, existing studies all fail to effectively filter noises by exploiting the cross-modal correlation between image content and tags. The current cross-modal vision-and-language representation learning methods, which selectively attend to the relevant parts of the image and text, show a promising direction. However, they are not suitable for social image retrieval since: (1) they deal with natural text sequences where the relationships between words can be easily captured by language models for cross-modal relevance estimation, while the tags are isolated and noisy; (2) they take (image, text) pair as input, and consequently cannot be employed directly for unimodal social image retrieval. This paper tackles the challenge of utilizing cross-modal interactions to learn precise representations for unimodal retrieval. The proposed framework, dubbed CGVR (Cross-modal Guided Visual Representation), extracts accurate semantic representations of images from noisy tags and transfers this ability to image-only hashing subnetwork by a carefully designed training scheme. To well capture correlated semantics and filter noises, it embeds a priori common-sense relationship among tags into attention computation for joint awareness of textual and visual context. Experiments show that CGVR achieves approximately 8.82 and 5.45 points improvement in MAP over the state-of-the-art on two widely used social image benchmarks. CGVR can serve as a new baseline for the image retrieval community.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2186-2198"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Saliency-Free and Aesthetic-Aware Panoramic Video Navigation","authors":"Chenglizhao Chen;Guangxiao Ma;Wenfeng Song;Shuai Li;Aimin Hao;Hong Qin","doi":"10.1109/TPAMI.2024.3516874","DOIUrl":"10.1109/TPAMI.2024.3516874","url":null,"abstract":"Most of the existing panoramic video navigation approaches are saliency-driven, whereby off-the-shelf saliency detection tools are directly employed to aid the navigation approaches in localizing video content that should be incorporated into the navigation path. In view of the dilemma faced by our research community, we rethink if the “saliency clues” are really appropriate to serve the panoramic video navigation task. According to our in-depth investigation, we argue that using “saliency clues” cannot generate a satisfying navigation path, failing to well represent the given panoramic video, and the views in the navigation path are also low aesthetics. In this paper, we present a brand-new navigation paradigm. Although our model is still trained on eye-fixations, our methodology can additionally enable the trained model to perceive the “meaningful” degree of the given panoramic video content. Outwardly, the proposed new approach is saliency-free, but inwardly, it is developed from saliency but biasing more to be “meaningful-driven”; thus, it can generate a navigation path with more appropriate content coverage. Besides, this paper is the first attempt to devise an unsupervised learning scheme to ensure all localized meaningful views in the navigation path have high aesthetics. Thus, the navigation path generated by our approach can also bring users an enjoyable watching experience. As a new topic in its infancy, we have devised a series of quantitative evaluation schemes, including objective verifications and subjective user studies. All these innovative attempts would have great potential to inspire and promote this research field in the near future.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2037-2054"},"PeriodicalIF":0.0,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Sun;Daniel Gehrig;Christos Sakaridis;Mathias Gehrig;Jingyun Liang;Peng Sun;Zhijie Xu;Kaiwei Wang;Luc Van Gool;Davide Scaramuzza
{"title":"A Unified Framework for Event-Based Frame Interpolation With Ad-Hoc Deblurring in the Wild","authors":"Lei Sun;Daniel Gehrig;Christos Sakaridis;Mathias Gehrig;Jingyun Liang;Peng Sun;Zhijie Xu;Kaiwei Wang;Luc Van Gool;Davide Scaramuzza","doi":"10.1109/TPAMI.2024.3510690","DOIUrl":"10.1109/TPAMI.2024.3510690","url":null,"abstract":"Effective video frame interpolation hinges on the adept handling of motion in the input scene. Prior work acknowledges asynchronous event information for this, but often overlooks whether motion induces blur in the video, limiting its scope to sharp frame interpolation. We instead propose a unified framework for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. To enhance the generalization from synthetic data to real event cameras, we integrate self-supervised framework with the proposed model to enhance the generalization on real-world datasets in the wild. At the dataset level, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring, and the joint task of both. Experiments on domain transfer reveal that self-supervised training effectively mitigates the performance degradation observed when transitioning from synthetic data to real-world data. Code and datasets are available at <uri>https://github.com/AHupuJR/REFID</uri>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2265-2279"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis","authors":"Dewei Zhou;You Li;Fan Ma;Zongxin Yang;Yi Yang","doi":"10.1109/TPAMI.2024.3510752","DOIUrl":"10.1109/TPAMI.2024.3510752","url":null,"abstract":"We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text & images and position control through boxes & masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1714-1728"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiqi Li;Wenhai Wang;Hongyang Li;Enze Xie;Chonghao Sima;Tong Lu;Qiao Yu;Jifeng Dai
{"title":"BEVFormer: Learning Bird’s-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers","authors":"Zhiqi Li;Wenhai Wang;Hongyang Li;Enze Xie;Chonghao Sima;Tong Lu;Qiao Yu;Jifeng Dai","doi":"10.1109/TPAMI.2024.3515454","DOIUrl":"10.1109/TPAMI.2024.3515454","url":null,"abstract":"Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2020-2036"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142804574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scale Propagation Network for Generalizable Depth Completion","authors":"Haotian Wang;Meng Yang;Xinhu Zheng;Gang Hua","doi":"10.1109/TPAMI.2024.3513440","DOIUrl":"10.1109/TPAMI.2024.3513440","url":null,"abstract":"Depth completion, inferring dense depth maps from sparse measurements, is crucial for robust 3D perception. Although deep learning based methods have made tremendous progress in this problem, these models cannot generalize well across different scenes that are unobserved in training, posing a fundamental limitation that yet to be overcome. A careful analysis of existing deep neural network architectures for depth completion, which are largely borrowing from successful backbones for image analysis tasks, reveals that a key design bottleneck actually resides in the conventional normalization layers. These normalization layers are designed, on one hand, to make training more stable, on the other hand, to build more visual invariance across scene scales. However, in depth completion, the scale is actually what we want to robustly estimate in order to better generalize to unseen scenes. To mitigate, we propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output, and simultaneously preserve the normalization operator for easy convergence. More specifically, we rescale the input using learned features of a single-layer perceptron from the normalized input, rather than directly normalizing the input as conventional normalization layers. We then develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone. We explore the composition of various basic blocks and architectures to achieve superior performance and efficient inference for generalizable depth completion. Extensive experiments are conducted on six unseen datasets with various types of sparse depth maps, i.e., randomly sampled 0.1%/1%/10% valid pixels, 4/8/16/32/64-line LiDAR points, and holes from Structured-Light. Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1908-1922"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu
{"title":"Spiking Variational Policy Gradient for Brain Inspired Reinforcement Learning","authors":"Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu","doi":"10.1109/TPAMI.2024.3511936","DOIUrl":"10.1109/TPAMI.2024.3511936","url":null,"abstract":"Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1975-1990"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iteratively Capped Reweighting Norm Minimization With Global Convergence Guarantee for Low-Rank Matrix Learning","authors":"Zhi Wang;Dong Hu;Zhuo Liu;Chao Gao;Zhen Wang","doi":"10.1109/TPAMI.2024.3512458","DOIUrl":"10.1109/TPAMI.2024.3512458","url":null,"abstract":"In recent years, a large number of studies have shown that low rank matrix learning (LRML) has become a popular approach in machine learning and computer vision with many important applications, such as image inpainting, subspace clustering, and recommendation system. The latest LRML methods resort to using some surrogate functions as convex or nonconvex relaxation of the rank function. However, most of these methods ignore the difference between different rank components and can only yield suboptimal solutions. To alleviate this problem, in this paper we propose a novel nonconvex regularizer called capped reweighting norm minimization (CRNM), which not only considers the different contributions of different rank components, but also adaptively truncates sequential singular values. With it, a general LRML model is obtained. Meanwhile, under some mild conditions, the global optimum of CRNM regularized least squares subproblem can be easily obtained in closed-form. Through the analysis of the theoretical properties of CRNM, we develop a high computational efficiency optimization method with convergence guarantee to solve the general LRML model. More importantly, by using the Kurdyka-Łojasiewicz (KŁ) inequality, its local and global convergence properties are established. Finally, we show that the proposed nonconvex regularizer, as well as the optimization approach are suitable for different low rank tasks, such as matrix completion and subspace clustering. Extensive experimental results demonstrate that the constructed models and methods provide significant advantages over several state-of-the-art low rank matrix leaning models and methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1923-1940"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition","authors":"Yulin Wang;Haoji Zhang;Yang Yue;Shiji Song;Chao Deng;Junlan Feng;Gao Huang","doi":"10.1109/TPAMI.2024.3514654","DOIUrl":"10.1109/TPAMI.2024.3514654","url":null,"abstract":"This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of <italic>spatial redundancy</i>, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the <italic>temporal</i> and <italic>sample-wise</i> redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1782-1799"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}