{"title":"SMEGA2:单动量缓冲的分布式异步深度神经网络训练","authors":"Refael Cohen, Ido Hakimi, A. Schuster","doi":"10.1145/3545008.3545010","DOIUrl":null,"url":null,"abstract":"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer\",\"authors\":\"Refael Cohen, Ido Hakimi, A. Schuster\",\"doi\":\"10.1145/3545008.3545010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer
As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2