EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.

Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang
{"title":"EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.","authors":"Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang","doi":"10.1109/TPAMI.2024.3401036","DOIUrl":null,"url":null,"abstract":"<p><p>The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3401036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/6 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).

EfficientTrain++:用于高效视觉骨干训练的通用课程学习
现代计算机视觉骨干的卓越性能(例如在 ImageNet-1K/22K 上学习的视觉转换器)通常伴随着昂贵的训练过程。本研究对这一问题做出了贡献,它将课程学习的理念推广到其原始表述之外,即使用由易到难的数据训练模型。具体来说,我们将训练课程重新表述为一个软选择函数,它在训练过程中在每个示例中逐步发现更难的模式,而不是执行从易到难的样本选择。我们的工作灵感来自于对视觉骨干学习动态的一个有趣观察:在训练的早期阶段,模型主要学习识别数据中一些 "较易学习 "的判别模式。通过频域和空间域观察这些模式时,会发现其中包含低频成分和自然图像内容,而不会出现失真或数据增强。受这些发现的启发,我们提出了一套课程,在这套课程中,模型在每个学习阶段都会利用所有的训练数据,但首先接触的是每个示例中 "较易学习 "的模式,随着训练的进行,再逐步引入较难学习的模式。为了以计算效率高的方式实现这一想法,我们在输入的傅立叶频谱中引入了裁剪操作,使模型能够只从低频成分中学习。然后,我们证明,通过调节数据增强的强度,可以很容易地揭示自然图像的内容。最后,我们整合了这两个方面,并通过提出量身定制的搜索算法来设计课程学习时间表。此外,我们还介绍了在具有挑战性的实际场景(如大规模并行训练、有限的输入/输出或数据预处理速度)中高效部署我们的方法的有用技术。由此产生的 EfficientTrain++ 方法简单、通用,但却出奇地有效。作为一种现成的方法,它在 ImageNet-1K/22K 上将各种流行模型(如 ResNet、ConvNeXt、DeiT、PVT、Swin、CSWin 和 CAFormer)的训练时间缩短了[公式:见正文],而且不影响准确性。它还展示了自我监督学习(如 MAE)的功效。代码见:https://github.com/LeapLabTHU/EfficientTrain。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信