fastthorovod:分布式DNN训练的加速并行消息传递调度

2021 IEEE Symposium on Computers and Communications (ISCC) Pub Date : 2021-09-05 DOI:10.1109/ISCC53001.2021.9631443

Yanghai Wang, Dezun Dong, Yemao Xu, Shuo Ouyang, Xiangke Liao

{"title":"fastthorovod:分布式DNN训练的加速并行消息传递调度","authors":"Yanghai Wang, Dezun Dong, Yemao Xu, Shuo Ouyang, Xiangke Liao","doi":"10.1109/ISCC53001.2021.9631443","DOIUrl":null,"url":null,"abstract":"Large-scale deep neural networks training have been widely deployed on dense-GPU public cloud clusters. Intensive communication and synchronization cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. Horovod is one of the most popular distributed communication frameworks to address the scale-out issue of deep learning training on GPU clusters. Existing public-cloud GPU datacenters, such as Amazon EC2 and Alibaba GPU cloud, are usually equipped with commodity high-speed Ethernet and TCP networking. In current vanilla Horovod, however, we observe that one GPU device is merely associated with at most one proxy communication process. The proxy process is responsible for dealing with all the communication operations of parameter all-reduce for one or multiple GPUs. Such configuration makes communication interface based on TCP protocols suffer from limited network goodput and incur training performance penalties. In this paper, we make the first attempt to improve the message passing interface of Horovod and address the mismatching between the computation and communication capability when deploying Horovod in TCP-based public-cloud GPU clusters. We propose FastHorovod to exploit more cost-efficient auxiliary communication processes on CPU to expedite parallel message-passing schedule for GPU. We conduct extensive experiments against state-of-the-art Horovod. The experiment results show that our design can significantly accelerate the distributed training communication on TCP-based public-cloud GPU clusters, and FastHorovod improves the training speed of AlexNet and VGG16 models by 64.5% and 72.6% respectively.","PeriodicalId":270786,"journal":{"name":"2021 IEEE Symposium on Computers and Communications (ISCC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training\",\"authors\":\"Yanghai Wang, Dezun Dong, Yemao Xu, Shuo Ouyang, Xiangke Liao\",\"doi\":\"10.1109/ISCC53001.2021.9631443\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale deep neural networks training have been widely deployed on dense-GPU public cloud clusters. Intensive communication and synchronization cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. Horovod is one of the most popular distributed communication frameworks to address the scale-out issue of deep learning training on GPU clusters. Existing public-cloud GPU datacenters, such as Amazon EC2 and Alibaba GPU cloud, are usually equipped with commodity high-speed Ethernet and TCP networking. In current vanilla Horovod, however, we observe that one GPU device is merely associated with at most one proxy communication process. The proxy process is responsible for dealing with all the communication operations of parameter all-reduce for one or multiple GPUs. Such configuration makes communication interface based on TCP protocols suffer from limited network goodput and incur training performance penalties. In this paper, we make the first attempt to improve the message passing interface of Horovod and address the mismatching between the computation and communication capability when deploying Horovod in TCP-based public-cloud GPU clusters. We propose FastHorovod to exploit more cost-efficient auxiliary communication processes on CPU to expedite parallel message-passing schedule for GPU. We conduct extensive experiments against state-of-the-art Horovod. The experiment results show that our design can significantly accelerate the distributed training communication on TCP-based public-cloud GPU clusters, and FastHorovod improves the training speed of AlexNet and VGG16 models by 64.5% and 72.6% respectively.\",\"PeriodicalId\":270786,\"journal\":{\"name\":\"2021 IEEE Symposium on Computers and Communications (ISCC)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Symposium on Computers and Communications (ISCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCC53001.2021.9631443\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium on Computers and Communications (ISCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCC53001.2021.9631443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大规模深度神经网络训练已经广泛部署在密集gpu公有云集群上。梯度和参数的密集通信和同步成本成为分布式深度学习训练的瓶颈。Horovod是最流行的分布式通信框架之一，用于解决GPU集群上深度学习训练的横向扩展问题。现有的公共云GPU数据中心,比如Amazon EC2和阿里巴巴GPU云,通常配备高速以太网和TCP网络商品。然而，在当前的vanilla Horovod中，我们观察到一个GPU设备最多只与一个代理通信进程相关联。代理进程负责处理一个或多个gpu的all-reduce参数的所有通信操作。这种配置使得基于TCP协议的通信接口受到有限的网络性能的影响，并且会导致训练性能损失。本文首次尝试改进Horovod的消息传递接口，解决了在基于tcp的公有云GPU集群中部署Horovod时计算能力和通信能力不匹配的问题。我们提出FastHorovod在CPU上利用更具成本效益的辅助通信进程来加快GPU的并行消息传递进度。我们对最先进的霍洛沃德进行了广泛的实验。实验结果表明，我们的设计可以显著加快基于tcp的公有云GPU集群上的分布式训练通信，FastHorovod将AlexNet和VGG16模型的训练速度分别提高了64.5%和72.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training

Large-scale deep neural networks training have been widely deployed on dense-GPU public cloud clusters. Intensive communication and synchronization cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. Horovod is one of the most popular distributed communication frameworks to address the scale-out issue of deep learning training on GPU clusters. Existing public-cloud GPU datacenters, such as Amazon EC2 and Alibaba GPU cloud, are usually equipped with commodity high-speed Ethernet and TCP networking. In current vanilla Horovod, however, we observe that one GPU device is merely associated with at most one proxy communication process. The proxy process is responsible for dealing with all the communication operations of parameter all-reduce for one or multiple GPUs. Such configuration makes communication interface based on TCP protocols suffer from limited network goodput and incur training performance penalties. In this paper, we make the first attempt to improve the message passing interface of Horovod and address the mismatching between the computation and communication capability when deploying Horovod in TCP-based public-cloud GPU clusters. We propose FastHorovod to exploit more cost-efficient auxiliary communication processes on CPU to expedite parallel message-passing schedule for GPU. We conduct extensive experiments against state-of-the-art Horovod. The experiment results show that our design can significantly accelerate the distributed training communication on TCP-based public-cloud GPU clusters, and FastHorovod improves the training speed of AlexNet and VGG16 models by 64.5% and 72.6% respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE Symposium on Computers and Communications (ISCC)

自引率

0.00%

发文量