SMEGA2:单动量缓冲的分布式异步深度神经网络训练

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545010

Refael Cohen, Ido Hakimi, A. Schuster

{"title":"SMEGA2:单动量缓冲的分布式异步深度神经网络训练","authors":"Refael Cohen, Ido Hakimi, A. Schuster","doi":"10.1145/3545008.3545010","DOIUrl":null,"url":null,"abstract":"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer\",\"authors\":\"Refael Cohen, Ido Hakimi, A. Schuster\",\"doi\":\"10.1145/3545008.3545010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着深度学习领域的发展，神经网络变得越来越大，训练它们已经成为一项要求很高且耗时的任务。为了解决这个问题，必须使用分布式深度学习将深度神经网络的训练扩展到许多工作人员。通常用于分配训练的同步算法容易受到错误或分散的工作人员的影响。异步算法没有同步问题，但引入了一个新问题，即过时问题。过时是由于应用过时的梯度而引起的，它会极大地阻碍收敛过程。此外，包含动量的异步算法通常需要为每个工作线程保持单独的动量缓冲区，这将消耗与工作线程数量成比例的额外内存。我们引入了一种新的异步方法SMEGA2，它需要一个动量缓冲区，而不管工人的数量。我们的方法以一种让我们估计参数未来位置的方式工作，从而最小化过期效应。我们在CIFAR和ImageNet数据集上评估了我们的方法，并表明SMEGA2在扩展到多达64个异步工作者时，在最终测试精度方面优于现有方法。开源代码:https://github.com/rafi-cohen/SMEGA2

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer

As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量