Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu
{"title":"Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need","authors":"Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu","doi":"10.1007/s11263-024-02218-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02218-0","url":null,"abstract":"<p>Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (<span>Aper</span>), which aggregates the embeddings of PTM and adapted models for classifier construction. <span>Aper </span>is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of <span>Aper </span>with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Denis Huseljic, Marek Herde, Paul Hahn, Mehmet Müjde, Bernhard Sick
{"title":"Systematic Evaluation of Uncertainty Calibration in Pretrained Object Detectors","authors":"Denis Huseljic, Marek Herde, Paul Hahn, Mehmet Müjde, Bernhard Sick","doi":"10.1007/s11263-024-02219-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02219-z","url":null,"abstract":"<p>In the field of deep learning based computer vision, the development of deep object detection has led to unique paradigms (e.g., two-stage or set-based) and architectures (e.g., <span>Faster-RCNN</span> or <span>DETR</span>) which enable outstanding performance on challenging benchmark datasets. Despite this, the trained object detectors typically do not reliably assess uncertainty regarding their own knowledge, and the quality of their probabilistic predictions is usually poor. As these are often used to make subsequent decisions, such inaccurate probabilistic predictions must be avoided. In this work, we investigate the uncertainty calibration properties of different pretrained object detection architectures in a multi-class setting. We propose a framework to ensure a fair, unbiased, and repeatable evaluation and conduct detailed analyses assessing the calibration under distributional changes (e.g., distributional shift and application to out-of-distribution data). Furthermore, by investigating the influence of different detector paradigms, post-processing steps, and suitable choices of metrics, we deliver novel insights into why poor detector calibration emerges. Based on these insights, we are able to improve the calibration of a detector by simply finetuning its last layer.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"146 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihong Zhang, Runzhao Yang, Jinli Suo, Yuxiao Cheng, Qionghai Dai
{"title":"Lightweight High-Speed Photography Built on Coded Exposure and Implicit Neural Representation of Videos","authors":"Zhihong Zhang, Runzhao Yang, Jinli Suo, Yuxiao Cheng, Qionghai Dai","doi":"10.1007/s11263-024-02198-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02198-1","url":null,"abstract":"<p>The demand for compact cameras capable of recording high-speed scenes with high resolution is steadily increasing. However, achieving such capabilities often entails high bandwidth requirements, resulting in bulky, heavy systems unsuitable for low-capacity platforms. To address this challenge, leveraging a coded exposure setup to encode a frame sequence into a blurry snapshot and subsequently retrieve the latent sharp video presents a lightweight solution. Nevertheless, restoring motion from blur remains a formidable challenge due to the inherent ill-posedness of motion blur decomposition, the intrinsic ambiguity in motion direction, and the diverse motions present in natural videos. In this study, we propose a novel approach to address these challenges by combining the classical coded exposure imaging technique with the emerging implicit neural representation for videos. We strategically embed motion direction cues into the blurry image during the imaging process. Additionally, we develop a novel implicit neural representation based blur decomposition network to sequentially extract the latent video frames from the blurry image, leveraging the embedded motion direction cues. To validate the effectiveness and efficiency of our proposed framework, we conduct extensive experiments using benchmark datasets and real-captured blurry images. The results demonstrate that our approach significantly outperforms existing methods in terms of both quality and flexibility. The code for our work is available at https://github.com/zhihongz/BDINR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"55 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xu Zhang, Zhe Chen, Jing Zhang, Tongliang Liu, Dacheng Tao
{"title":"Learning General and Specific Embedding with Transformer for Few-Shot Object Detection","authors":"Xu Zhang, Zhe Chen, Jing Zhang, Tongliang Liu, Dacheng Tao","doi":"10.1007/s11263-024-02199-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02199-0","url":null,"abstract":"<p>Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Box Regression and Mask Segmentation Under Long-Tailed Distribution with Gradient Transfusing","authors":"Tao Wang, Li Yuan, Xinchao Wang, Jiashi Feng","doi":"10.1007/s11263-024-02104-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02104-9","url":null,"abstract":"<p>Learning object detectors under long-tailed data distribution is challenging and has been widely studied recently, the prior works mainly focus on balancing the learning signal of classification task such that samples from tail object classes are effectively recognized. However, the learning difficulty of other class-wise tasks including bounding box regression and mask segmentation are not explored before. In this work, we investigate how long-tailed distribution affects the optimization of box regression and mask segmentation tasks. We find that although the standard class-wise box regression and mask segmentation offer strong class-specific prediction, they suffer from limited training signal and instability on the tail object classes. Aiming to address the limitation, our insight is that the knowledge of box regression and object segmentation is naturally shared across classes. We thus develop a cross class gradient transfusing (CRAT) approach to transfer the abundant training signal from head classes to help the training of sample-scarce tail classes. The transferring process is guided by the Fisher information to aggregate useful signals. CRAT can be seamlessly integrated into existing end-to-end or decoupled long-tailed object detection pipelines to robustly learn class-wise box regression and mask segmentation under long-tailed distribution. Our method improves the state-of-the-art long-tailed object detection and instance segmentation models with an average of 3.0 tail AP on the LVIS benchmark. The code implementation will be available at https://github.com/twangnh/CRAT</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AROID: Improving Adversarial Robustness Through Online Instance-Wise Data Augmentation","authors":"Lin Li, Jianing Qiu, Michael Spratling","doi":"10.1007/s11263-024-02206-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02206-4","url":null,"abstract":"<p>Deep neural networks are vulnerable to adversarial examples. Adversarial training (AT) is an effective defense against adversarial examples. However, AT is prone to overfitting which degrades robustness substantially. Recently, data augmentation (DA) was shown to be effective in mitigating robust overfitting if appropriately designed and optimized for AT. This work proposes a new method to automatically learn online, instance-wise, DA policies to improve robust generalization for AT. This is the first automated DA method specific for robustness. A novel policy learning objective, consisting of Vulnerability, Affinity and Diversity, is proposed and shown to be sufficiently effective and efficient to be practical for automatic DA generation during AT. Importantly, our method dramatically reduces the cost of policy search from the 5000 h of AutoAugment and the 412 h of IDBH to 9 h, making automated DA more practical to use for adversarial robustness. This allows our method to efficiently explore a large search space for a more effective DA policy and evolve the policy as training progresses. Empirically, our method is shown to outperform all competitive DA methods across various model architectures and datasets. Our DA policy reinforced vanilla AT to surpass several state-of-the-art AT methods regarding both accuracy and robustness. It can also be combined with those advanced AT methods to further boost robustness. Code and pre-trained models are available at: https://github.com/TreeLLi/AROID.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142085114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Atif Butt, Hassan Ali, Adnan Qayyum, Waqas Sultani, Ala Al-Fuqaha, Junaid Qadir
{"title":"R $$^{2}$$ S100K: Road-Region Segmentation Dataset for Semi-supervised Autonomous Driving in the Wild","authors":"Muhammad Atif Butt, Hassan Ali, Adnan Qayyum, Waqas Sultani, Ala Al-Fuqaha, Junaid Qadir","doi":"10.1007/s11263-024-02207-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02207-3","url":null,"abstract":"<p>Semantic understanding of roadways is a key enabling factor for safe autonomous driving. However, existing autonomous driving datasets provide well-structured urban roads while ignoring unstructured roadways containing distress, potholes, water puddles, and various kinds of road patches i.e., earthen, gravel etc. To this end, we introduce Road Region Segmentation dataset (R<sup>2</sup>S100K)—a large-scale dataset and benchmark for training and evaluation of road segmentation in aforementioned challenging unstructured roadways. R<sup>2</sup>S100K comprises 100K images extracted from a large and diverse set of video sequences covering more than 1000 km of roadways. Out of these 100K privacy respecting images, 14,000 images have fine pixel-labeling of road regions, with 86,000 unlabeled images that can be leveraged through semi-supervised learning methods. Alongside, we present an Efficient Data Sampling based self-training framework to improve learning by leveraging unlabeled data. Our experimental results demonstrate that the proposed method significantly improves learning methods in generalizability and reduces the labeling cost for semantic segmentation tasks. Our benchmark will be publicly available to facilitate future research at https://r2s100k.github.io/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142045651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu, Hanzi Wang
{"title":"IMC-Det: Intra–Inter Modality Contrastive Learning for Video Object Detection","authors":"Qiang Qi, Zhenyu Qiu, Yan Yan, Yang Lu, Hanzi Wang","doi":"10.1007/s11263-024-02201-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02201-9","url":null,"abstract":"<p>Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra–inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142045380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Accurate and Robust Pedestrian Detection via Variational Inference","authors":"Huanyu He, Weiyao Lin, Yuang Zhang, Tianyao He, Yuxi Li, Jianguo Li","doi":"10.1007/s11263-024-02216-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02216-2","url":null,"abstract":"<p>Pedestrian detection is notoriously considered a challenging task due to the frequent occlusion between humans. Unlike generic object detection, pedestrian detection involves a single category but dense instances, making it crucial to achieve accurate and robust object localization. By analogizing instance-level localization to a variational autoencoder and regarding the dense proposals as the latent variables, we establish a unique perspective of formulating pedestrian detection as a variational inference problem. From this vantage, we propose the Variational Pedestrian Detector (VPD), which uses a probabilistic model to estimate the true posterior of inferred proposals and applies a reparameterization trick to approximate the expected detection likelihood. In order to adapt the variational inference problem to the case of pedestrian detection, we propose a series of customized designs to cope with the issue of occlusion and spatial vibration. Specifically, we propose the Normal Gaussian and its variant of the Mixture model to parameterize the posterior in complicated scenarios. The inferred posterior is regularized by a conditional prior related to the ground-truth distribution, thus directly coupling the latent variables to specific target objects. Based on the posterior distribution, maximum detection likelihood estimation is applied to optimize the pedestrian detector, where a lightweight statistic decoder is designed to cast the detection likelihood into a parameterized form and enhance the confidence score estimation. With this variational inference process, VPD endows each proposal with the discriminative ability from its adjacent distractor due to the disentangling nature of the latent variable in variational inference, achieving accurate and robust detection in crowded scenes. Experiments conducted on CrowdHuman, CityPersons, and MS COCO demonstrate that our method is not only plug-and-play for numerous popular single-stage methods and two-stage methods but also can achieve a remarkable performance gain in highly occluded scenarios. The code for this project can be found at https://github.com/hhy-ee/VPD.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142022050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Rank Transformer for High-Resolution Hyperspectral Computational Imaging","authors":"Yuanye Liu, Renwei Dian, Shutao Li","doi":"10.1007/s11263-024-02203-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02203-7","url":null,"abstract":"<p>Spatial-spectral fusion aims to obtain high-resolution hyperspectral image (HR-HSI) by fusing low-resolution hyperspectral image (LR-HSI) and high-resolution multispectral image (MSI). Recently, many convolutional neural network (CNN)-based methods have achieved excellent results. However, these methods only consider local contextual information, which limits the fusion performance. Although some Transformer-based methods overcome this problem, they ignore some intrinsic characteristics of HR-HSI, such as spatial low-rank characteristics, resulting in large parameters and high computational cost. To address this problem, we propose a low-rank Transformer network (LRTN) for spatial-spectral fusion. LRTN can make full use of the spatial prior of MSI and the spectral prior of LR-HSI, thereby achieving outstanding fusion performance. Specifically, in the feature extraction stage, we utilize the cross-attention mechanism to force the model to focus on spatial information that is not available in LR-HSI and spectral information that is not available in MSI. In the feature fusion stage, we carefully design a self-attention mechanism guided by spatial and spectral priors to improve spatial and spectral fidelity. Moreover, we present a novel spatial low-rank cross-attention module, which can better capture global spatial information compared to other Transformer structures. In this module, we combine the matrix factorization theorem to fully exploit the spatial low-rank characteristics of HSI, which reduces parameters and computational cost while ensuring fusion quality. Experiments on several datasets demonstrate that our method outperforms the current state-of-the-art spatial-spectral fusion methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"144 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142007487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}