Aperiodic Local SGD: Beyond Local SGD

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545013

Hao Zhang, Tingting Wu, Siyao Cheng, Jie Liu

{"title":"Aperiodic Local SGD: Beyond Local SGD","authors":"Hao Zhang, Tingting Wu, Siyao Cheng, Jie Liu","doi":"10.1145/3545008.3545013","DOIUrl":null,"url":null,"abstract":"Variations of stochastic gradient decedent (SGD) methods are at the core of training deep neural network models. However, in distributed deep learning, where multiple computing devices and data segments are employed in the training process, the performance of SGD can be significantly limited by the overhead of gradient communication. Local SGD methods are designed to overcome this bottleneck by averaging individual gradients trained over parallel workers after multiple local iterations. Currently, both for theoretical analyses and for practical applications, most studies employ periodic synchronization scheme by default, while few of them focus on the aperiodic schemes to obtain better performance models with limited computation and communication overhead. In this paper, we investigate local SGD with an arbitrary synchronization scheme to answer two questions: (1) Is the periodic synchronization scheme best? (2) If not, what is the optimal one? First, for any synchronization scheme, we derive the performance boundary with fixed overhead, and formulate the performance optimization under given computation and communication constraints. Then we find a succinct property of the optimal scheme that the local iteration number decreases as training continues, which indicates the periodic one is suboptimal. Furthermore, with some reasonable approximations, we obtain an explicit form of the optimal scheme and propose Aperiodic Local SGD (ALSGD) as an improved substitute for local SGD without any overhead increment. Our experiments also confirm that with the same computation and communication overhead, ALSGD outperforms local SGD in performance, especially for heterogeneous data.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Variations of stochastic gradient decedent (SGD) methods are at the core of training deep neural network models. However, in distributed deep learning, where multiple computing devices and data segments are employed in the training process, the performance of SGD can be significantly limited by the overhead of gradient communication. Local SGD methods are designed to overcome this bottleneck by averaging individual gradients trained over parallel workers after multiple local iterations. Currently, both for theoretical analyses and for practical applications, most studies employ periodic synchronization scheme by default, while few of them focus on the aperiodic schemes to obtain better performance models with limited computation and communication overhead. In this paper, we investigate local SGD with an arbitrary synchronization scheme to answer two questions: (1) Is the periodic synchronization scheme best? (2) If not, what is the optimal one? First, for any synchronization scheme, we derive the performance boundary with fixed overhead, and formulate the performance optimization under given computation and communication constraints. Then we find a succinct property of the optimal scheme that the local iteration number decreases as training continues, which indicates the periodic one is suboptimal. Furthermore, with some reasonable approximations, we obtain an explicit form of the optimal scheme and propose Aperiodic Local SGD (ALSGD) as an improved substitute for local SGD without any overhead increment. Our experiments also confirm that with the same computation and communication overhead, ALSGD outperforms local SGD in performance, especially for heterogeneous data.

查看原文本刊更多论文

非周期本地SGD:超出本地SGD

随机梯度衰减(SGD)方法的变化是训练深度神经网络模型的核心。然而，在分布式深度学习中，在训练过程中使用多个计算设备和数据段，梯度通信的开销会严重限制SGD的性能。局部SGD方法是为了克服这个瓶颈而设计的，方法是在多个局部迭代之后，对并行工人训练的单个梯度进行平均。目前，无论是理论分析还是实际应用，大多数研究都默认采用周期同步方案，而很少关注非周期同步方案，以在有限的计算和通信开销下获得更好的性能模型。本文研究了任意同步方案下的局部SGD，以回答两个问题:(1)周期性同步方案是否最好?(2)如果不是，最优的是什么?首先，对于任意同步方案，我们推导出固定开销下的性能边界，并在给定的计算和通信约束下给出了性能优化方案。然后，我们发现了最优方案的一个简洁的性质，即局部迭代次数随着训练的继续而减少，这表明周期方案是次优的。此外，通过一些合理的近似，我们得到了最优方案的显式形式，并提出了非周期局部SGD (ALSGD)作为局部SGD的改进替代品，没有任何开销增量。我们的实验还证实，在相同的计算和通信开销下，ALSGD在性能上优于本地SGD，特别是对于异构数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量