MCR-DL:用于深度学习的混合匹配通信运行时

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-15 DOI:10.1109/IPDPS54959.2023.00103

Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda

{"title":"MCR-DL:用于深度学习的混合匹配通信运行时","authors":"Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00103","DOIUrl":null,"url":null,"abstract":"In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MCR-DL: Mix-and-Match Communication Runtime for Deep Learning\",\"authors\":\"Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda\",\"doi\":\"10.1109/IPDPS54959.2023.00103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.\",\"PeriodicalId\":343684,\"journal\":{\"name\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"118 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS54959.2023.00103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

近年来，许多最先进的深度学习(DL)模型的训练需求已经超出了单个处理器的计算和内存能力，并且需要在处理器之间进行分布。训练如此庞大的模型需要先进的并行策略[1]，[2]来保持效率。然而，这种分布式DL并行策略需要在广泛的消息大小和规模范围内混合使用集体和点对点通信操作。使用高级并行策略的模型示例包括深度学习推荐模型(DLRM)[3]和专家混合模型(MoE)[4]，[5]。通信库的性能在不同的通信操作、规模和消息大小之间差异很大。我们提出MCR-DL:一个可扩展的DL通信框架，支持所有点对点和集体操作，同时使用户能够动态混合和匹配给定操作的通信后端，而不会出现死锁。MCR-DL还附带了一个调优套件，用于动态选择给定输入张量的最佳通信后端。我们选择DeepSpeed-MoE和DLRM作为候选DL模型，并在Lassen HPC系统上的256个V100 gpu上证明了DS-MoE吞吐量提高了31%。此外，我们在密集的Megatron-DeepSpeed模型中实现了20%的吞吐量提升，在32个A100 gpu上使用Theta-GPU HPC系统实现了25%的DLRM吞吐量提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量