Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda
{"title":"Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication","authors":"Qinghua Zhou, Quentin G. Anthony, Lang Xu, A. Shafi, M. Abduljabbar, H. Subramoni, Dhabaleswar K. Panda","doi":"10.1109/IPDPS54959.2023.00023","DOIUrl":null,"url":null,"abstract":"Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.
利用压缩辅助的全聚和减散通信加速分布式深度学习训练
全分片数据并行(FSDP)技术通过扩展深度学习(DL)模型的数据并行训练来实现更高的性能。它将模型参数、梯度和模型的优化器状态在多个gpu之间进行分片。因此,这需要数据密集型的Allgather和Reduce-Scatter通信来共享模型参数,这成为瓶颈。使用gpu感知MPI库的现有方案非常容易使互连带宽饱和。因此,将基于gpu的压缩集成到MPI库中已被证明是有效的,可以实现更快的训练时间。在本文中,我们提出了一个优化的Allgather和Reduce-Scatter集合的环算法,该算法包含了一个有效的集合级在线压缩方案。在微基准测试水平上,Allgather在现代GPU集群上使用最先进的MPI库,与基线和现有的基于点对点的压缩相比,实现了高达83.6%和30.3%的优势。与基线和点对点压缩相比,Reduce-Scatter分别达到了88.1%和40.6%。对于使用PyTorch-FSDP进行分布式深度学习训练,我们的方法比基线训练速度快31.7%,与现有的基于点对点的压缩相比,在保持相似精度的情况下,训练速度提高了12.5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信