分布式深度学习中如何平衡掉队者和过时者

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI:10.1109/HiPC.2018.00011

Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma

{"title":"分布式深度学习中如何平衡掉队者和过时者","authors":"Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma","doi":"10.1109/HiPC.2018.00011","DOIUrl":null,"url":null,"abstract":"Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Balancing Stragglers Against Staleness in Distributed Deep Learning\",\"authors\":\"Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma\",\"doi\":\"10.1109/HiPC.2018.00011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).\",\"PeriodicalId\":113335,\"journal\":{\"name\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 25th International Conference on High Performance Computing (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC.2018.00011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2018.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

同步SGD通常是在合理的时间框架内在计算集群上训练深度学习模型的首选算法。然而，即使有大量的工作器(cpu或gpu)可供训练，计算节点的异构性和互连网络的不可靠性也经常成为训练速度的瓶颈。由于工人必须在每个模型更新步骤中等待彼此，因此即使是单个掉队/缓慢的工人也会破坏整个训练性能。在本文中，我们提出了一种新的方法来缓解大型计算集群中的离散问题。我们将计算节点分成多个组，其中每个组同步更新存储在其自己的参数服务器中的模型。不同组的参数服务器以异步方式在中心参数服务器中更新模型。在同一组(甚至单独的组)中很少有掉队者对计算性能的影响很小。异步更新的过时性可以通过限制组的数量来控制。从本质上讲，我们的方法提供了一种在纯同步和纯异步设置之间无缝移动的机制，从而在同步SGD的计算开销和纯异步SGD的精度降低之间取得平衡。我们的经验表明，随着离散节点延迟的增加(一个节点延迟超过300%)，可用工人的渐进分组仍然在无延迟情况下的20%内完成训练，并且受允许的精度退化(与无延迟情况相比≤2.5%)所控制的组数量的限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Balancing Stragglers Against Staleness in Distributed Deep Learning

Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 25th International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量