ScaMP: Scalable Meta-Parallelism for Deep Learning Search

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) Pub Date : 2023-05-01 DOI:10.1109/CCGrid57682.2023.00044

Quentin G. Anthony, Lang Xu, A. Shafi, H. Subramoni, Dhabaleswar K. Panda

{"title":"ScaMP: Scalable Meta-Parallelism for Deep Learning Search","authors":"Quentin G. Anthony, Lang Xu, A. Shafi, H. Subramoni, Dhabaleswar K. Panda","doi":"10.1109/CCGrid57682.2023.00044","DOIUrl":null,"url":null,"abstract":"Deep Learning (DL) models are growing exponentially and require increasingly powerful High Performance Computing (HPC) systems to train them. Achieving state-of-the-art results requires carefully tuning the DL model architecture and training settings, which is a time-consuming process commonly relegated to distributed search frameworks and trial-and-error. However, search frameworks don't provide a flexible parallelism scheme within and among the chosen DL framework for modern out-of-core DL models. In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate device memory saturation via a powerful load balancing engine. SCaMP estimates the memory requirements of each candidate architecture and automatically applies the appropriate model-parallel degree and maximum batch size supported for the given candidate. Further, HPO and NAS with SCaMP are highly customizable via flexible configuration options. We evaluate the benefits of our designs on synthetic training benchmarks and in training a state-of-the-art vision transformer model. We select transformers as a candidate DL model type and demonstrate a 29% improvement in end-to-end HPO time on 32 V100 GPUs on the Lassen and ThetaGPU HPC systems. Further, we demonstrate a reduction in the proportion of NAS time spent in communication from 28% to 15%. Finally, we thoroughly verify the correctness of SCaMP by training a state-of-the-art SwinIR model.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep Learning (DL) models are growing exponentially and require increasingly powerful High Performance Computing (HPC) systems to train them. Achieving state-of-the-art results requires carefully tuning the DL model architecture and training settings, which is a time-consuming process commonly relegated to distributed search frameworks and trial-and-error. However, search frameworks don't provide a flexible parallelism scheme within and among the chosen DL framework for modern out-of-core DL models. In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate device memory saturation via a powerful load balancing engine. SCaMP estimates the memory requirements of each candidate architecture and automatically applies the appropriate model-parallel degree and maximum batch size supported for the given candidate. Further, HPO and NAS with SCaMP are highly customizable via flexible configuration options. We evaluate the benefits of our designs on synthetic training benchmarks and in training a state-of-the-art vision transformer model. We select transformers as a candidate DL model type and demonstrate a 29% improvement in end-to-end HPO time on 32 V100 GPUs on the Lassen and ThetaGPU HPC systems. Further, we demonstrate a reduction in the proportion of NAS time spent in communication from 28% to 15%. Finally, we thoroughly verify the correctness of SCaMP by training a state-of-the-art SwinIR model.

查看原文本刊更多论文

ScaMP:深度学习搜索的可伸缩元并行

深度学习(DL)模型呈指数级增长，需要越来越强大的高性能计算(HPC)系统来训练它们。获得最先进的结果需要仔细调整DL模型架构和训练设置，这是一个耗时的过程，通常被归为分布式搜索框架和试错。然而，搜索框架并不能在所选的深度学习框架内部和框架之间为现代核心外深度学习模型提供灵活的并行方案。在本文中，我们提出了深度学习搜索的可扩展元并行(Scalable Meta-Parallelism, ScaMP):一个分布式超参数优化(HPO)和神经结构搜索(NAS)框架，支持具有灵活并行方案的核外模型。SCaMP集成到现代DL生态系统中，并通过强大的负载平衡引擎实现并发候选架构的高效并行训练和聚合设备内存饱和。SCaMP估计每个候选体系结构的内存需求，并自动应用适当的模型并行度和给定候选体系结构支持的最大批大小。此外，使用SCaMP的HPO和NAS可以通过灵活的配置选项进行高度定制。我们评估了我们的设计在综合训练基准和训练最先进的视觉变压器模型方面的好处。我们选择变压器作为候选DL模型类型，并在Lassen和ThetaGPU HPC系统上的32个V100 gpu上证明了端到端HPO时间提高了29%。此外，我们证明了NAS在通信中花费的时间比例从28%减少到15%。最后，我们通过训练最先进的SwinIR模型来彻底验证SCaMP的正确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

自引率

0.00%

发文量