Slimmable Networks for Contrastive Self-supervised Learning

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-09-26 DOI:10.1007/s11263-024-02211-7

Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

{"title":"Slimmable Networks for Contrastive Self-supervised Learning","authors":"Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang","doi":"10.1007/s11263-024-02211-7","DOIUrl":null,"url":null,"abstract":"Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"11 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02211-7","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.

Abstract Image

查看原文本刊更多论文

用于对比式自我监督学习的可瘦网络

自我监督学习在预训练大型模型方面取得了重大进展，但在小型模型方面却举步维艰。解决这一问题的主流方案主要依赖于知识提炼，这涉及一个两阶段的过程：首先训练一个大型教师模型，然后对其进行提炼，以提高小型模型的泛化能力。在这项工作中，我们引入了另一种单阶段解决方案，即用于对比性自监督学习的可纤化网络（SlimCLR），无需额外的教师，即可获得预先训练好的小型模型。Slimmable 网络由一个完整网络和多个权重共享子网络组成，这些子网络可以通过一次预训练获得各种网络，包括计算成本较低的小型网络。然而，在自我监督情况下，权重共享网络之间的干扰会导致性能严重下降，具体表现为梯度幅度不平衡和梯度方向发散。前者表明，在反向传播过程中，一小部分参数产生了主导梯度，而主要参数可能没有完全优化。后者表明梯度方向紊乱，优化过程不稳定。为了解决这些问题，我们引入了三种技术，使主参数产生主导梯度，子网络具有一致的输出。这些技术包括子网络的慢速启动训练、在线蒸馏和根据模型大小进行损失再加权。此外，理论结果表明，在线性评估过程中，单个可纤细线性层是次优的。因此，在线性评估过程中应用了可切换线性探测层。我们将 SlimCLR 与典型的对比学习框架进行了实例化，并在参数和 FLOP 更少的情况下取得了比以前的技术更好的性能。代码见 https://github.com/mzhaoshuai/SlimCLR。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.