EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2021-10-18 DOI:10.1145/3545008.3545011

Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan

{"title":"EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks","authors":"Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan","doi":"10.1145/3545008.3545011","DOIUrl":null,"url":null,"abstract":"Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41 × speedup compared to the state-of-the-art distributed training baselines.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41 × speedup compared to the state-of-the-art distributed training baselines.

查看原文本刊更多论文

拥抱:加速稀疏通信用于深度神经网络的分布式训练

分布式数据并行训练已被广泛应用于深度神经网络(DNN)模型。尽管目前的深度学习(DL)框架对于图像分类模型等密集模型的可扩展性很好，但我们发现这些深度学习框架对于具有高度稀疏嵌入表的自然语言处理(NLP)模型等稀疏模型的可扩展性相对较低。大多数现有的工作忽略了模型参数的稀疏性，因此遭受了显著但不必要的通信开销。在本文中，我们提出了一个有效的通信框架EmbRace来加速稀疏模型分布式训练的通信。EmbRace引入了稀疏感知混合通信(Sparsity-aware Hybrid Communication)，将AlltoAll和模型并行性集成到数据并行训练中，以减少高度稀疏参数的通信开销。为了有效地将稀疏通信与前向和后向计算进行重叠，EmbRace进一步设计了一种二维通信调度方法，该方法优化了模型计算过程，放松了嵌入的依赖性，用优先级队列调度每个嵌入行的稀疏通信。我们基于PyTorch和Horovod实现了一个EmbRace的原型，并使用四个有代表性的NLP模型进行了综合评估。实验结果表明，与目前最先进的分布式训练基线相比，EmbRace实现了高达2.41倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量