SMEGA2:单动量缓冲的分布式异步深度神经网络训练

Refael Cohen, Ido Hakimi, A. Schuster
{"title":"SMEGA2:单动量缓冲的分布式异步深度神经网络训练","authors":"Refael Cohen, Ido Hakimi, A. Schuster","doi":"10.1145/3545008.3545010","DOIUrl":null,"url":null,"abstract":"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer\",\"authors\":\"Refael Cohen, Ido Hakimi, A. Schuster\",\"doi\":\"10.1145/3545008.3545010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着深度学习领域的发展,神经网络变得越来越大,训练它们已经成为一项要求很高且耗时的任务。为了解决这个问题,必须使用分布式深度学习将深度神经网络的训练扩展到许多工作人员。通常用于分配训练的同步算法容易受到错误或分散的工作人员的影响。异步算法没有同步问题,但引入了一个新问题,即过时问题。过时是由于应用过时的梯度而引起的,它会极大地阻碍收敛过程。此外,包含动量的异步算法通常需要为每个工作线程保持单独的动量缓冲区,这将消耗与工作线程数量成比例的额外内存。我们引入了一种新的异步方法SMEGA2,它需要一个动量缓冲区,而不管工人的数量。我们的方法以一种让我们估计参数未来位置的方式工作,从而最小化过期效应。我们在CIFAR和ImageNet数据集上评估了我们的方法,并表明SMEGA2在扩展到多达64个异步工作者时,在最终测试精度方面优于现有方法。开源代码:https://github.com/rafi-cohen/SMEGA2
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer
As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信