{"title":"Slimmable Networks for Contrastive Self-supervised Learning","authors":"Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang","doi":"10.1007/s11263-024-02211-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02211-7","url":null,"abstract":"<p>Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by <i>gradient magnitude imbalance</i> and <i>gradient direction divergence</i>. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"11 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mutual Prompt Leaning for Vision Language Models","authors":"Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Jingyuan Feng, Shengsheng Wang, Jingdong Wang","doi":"10.1007/s11263-024-02243-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02243-z","url":null,"abstract":"<p>Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang
{"title":"Robust Deep Object Tracking against Adversarial Attacks","authors":"Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang","doi":"10.1007/s11263-024-02226-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02226-0","url":null,"abstract":"<p>Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Breaking the Limits of Reliable Prediction via Generated Data","authors":"Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1007/s11263-024-02221-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02221-5","url":null,"abstract":"<p>In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"18 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02227-z","url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>(times )</span>–2500<span>(times )</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep
{"title":"Lidar Panoptic Segmentation in an Open World","authors":"Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep","doi":"10.1007/s11263-024-02166-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02166-9","url":null,"abstract":"<p>Addressing Lidar Panoptic Segmentation (<i>LPS</i>) is crucial for safe deployment of autnomous vehicles. <i>LPS</i> aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including <span>thing</span> classes of countable objects (e.g., pedestrians and vehicles) and <span>stuff</span> classes of amorphous regions (e.g., vegetation and road). Importantly, <i>LPS</i> requires segmenting individual <span>thing</span> instances (<i>e.g</i>., every single vehicle). Current <i>LPS</i> methods make an unrealistic assumption that the semantic class vocabulary is <i>fixed</i> in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of <i>novel</i> classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study <i>LPS</i> in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of <span>thing</span> and <span>stuff</span> classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (<i>i.e</i>., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Active Learning for Low-Altitude Drone-View Object Detection","authors":"Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan","doi":"10.1007/s11263-024-02228-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02228-y","url":null,"abstract":"<p>Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang
{"title":"In Search of Lost Online Test-Time Adaptation: A Survey","authors":"Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang","doi":"10.1007/s11263-024-02213-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02213-5","url":null,"abstract":"<p>This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han
{"title":"WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation","authors":"Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han","doi":"10.1007/s11263-024-02224-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02224-2","url":null,"abstract":"<p>Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the <span>(frac{1}{16})</span> down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the <i>val</i> set of PASCAL VOC 2012 and 46.1% mIoU on the <i>val</i> set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142138035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji
{"title":"Continual Face Forgery Detection via Historical Distribution Preserving","authors":"Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji","doi":"10.1007/s11263-024-02160-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02160-1","url":null,"abstract":"<p>Face forgery techniques have advanced rapidly and pose serious security threats. Existing face forgery detection methods try to learn generalizable features, but they still fall short of practical application. Additionally, finetuning these methods on historical training data is resource-intensive in terms of time and storage. In this paper, we focus on a novel and challenging problem: Continual Face Forgery Detection (CFFD), which aims to efficiently learn from new forgery attacks without forgetting previous ones. Specifically, we propose a Historical Distribution Preserving (HDP) framework that reserves and preserves the distributions of historical faces. To achieve this, we use universal adversarial perturbation (UAP) to simulate historical forgery distribution, and knowledge distillation to maintain the distribution variation of real faces across different models. We also construct a new benchmark for CFFD with three evaluation protocols. Our extensive experiments on the benchmarks show that our method outperforms the state-of-the-art competitors. Our code is available at https://github.com/skJack/HDP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142131051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}