网络是分布式训练的瓶颈吗?

Proceedings of the Workshop on Network Meets AI & ML Pub Date : 2020-06-17 DOI:10.1145/3405671.3405810

Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, R. Arora, Xin Jin

{"title":"网络是分布式训练的瓶颈吗?","authors":"Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, R. Arora, Xin Jin","doi":"10.1145/3405671.3405810","DOIUrl":null,"url":null,"abstract":"Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent. In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100x compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2x-5x gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.","PeriodicalId":254313,"journal":{"name":"Proceedings of the Workshop on Network Meets AI & ML","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":"{\"title\":\"Is Network the Bottleneck of Distributed Training?\",\"authors\":\"Zhen Zhang, Chaokun Chang, Haibin Lin, Yida Wang, R. Arora, Xin Jin\",\"doi\":\"10.1145/3405671.3405810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent. In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100x compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2x-5x gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.\",\"PeriodicalId\":254313,\"journal\":{\"name\":\"Proceedings of the Workshop on Network Meets AI & ML\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"50\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Network Meets AI & ML\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3405671.3405810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Network Meets AI & ML","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3405671.3405810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

摘要

近年来，人们对如何提高分布式训练的通信效率进行了大量的研究。然而，很少有人系统地了解网络是否是瓶颈，以及在多大程度上是瓶颈。在本文中，我们采用第一性原理的方法来测量和分析分布式训练的网络性能。正如预期的那样，我们的测量证实了通信是阻止分布式训练线性扩展的组件。然而，与通常的看法相反，我们发现网络的利用率很低，如果网络能够得到充分利用，分布式训练可以达到接近于1的比例因子。此外，虽然最近许多关于梯度压缩的建议主张超过100倍的压缩比，但我们表明，在充分利用网络的情况下，100 Gbps网络不需要梯度压缩。另一方面，像10gbps这样的低速网络只需要2 -5倍的梯度压缩比就可以实现几乎线性的横向扩展。与梯度压缩等应用程序级技术相比，网络级优化不需要更改应用程序，也不会损害训练模型的性能。因此，我们主张分布式训练的真正挑战是网络社区开发高性能的网络传输，以充分利用网络容量并实现线性扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Is Network the Bottleneck of Distributed Training?

Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent. In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100x compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2x-5x gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Network Meets AI & ML

自引率

0.00%

发文量