Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI:10.1145/3368474.3368498

Arissa Wongpanich, Yang You, J. Demmel

{"title":"Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning","authors":"Arissa Wongpanich, Yang You, J. Demmel","doi":"10.1145/3368474.3368498","DOIUrl":null,"url":null,"abstract":"In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In recent years, the field of machine learning has seen significant advances as data becomes more abundant and deep learning models become larger and more complex. However, these improvements in accuracy [2] have come at the cost of longer training time. As a result, state-of-the-art models like OpenAI's GPT-2 [18] or AlphaZero [20] require the use of distributed systems or clusters in order to speed up training. Currently, there exist both asynchronous and synchronous solvers for distributed training. In this paper, we implement state-of-the-art asynchronous and synchronous solvers, then conduct a comparison between them to help readers pick the most appropriate solver for their own applications. We address three main challenges: (1) implementing asynchronous solvers that can outperform six common algorithm variants, (2) achieving state-of-the-art distributed performance for various applications with different computational patterns, and (3) maintaining accuracy for large-batch asynchronous training. For asynchronous algorithms, we implement an algorithm called EA-wild, which combines the idea of non-locking wild updates from Hogwild! [19] with EASGD. Our implementation is able to scale to 217,600 cores and finish 90 epochs of training the ResNet-50 model on ImageNet in 15 minutes (the baseline takes 29 hours on eight NVIDIA P100 GPUs). We conclude that more complex models (e.g., ResNet-50) favor synchronous methods, while our asynchronous solver outperforms the synchronous solver for models with a low computation-communication ratio. The results are documented in this paper; for more results, readers can refer to our supplemental website 1.

查看原文本刊更多论文

重新思考异步求解器在分布式深度学习中的价值

近年来，随着数据变得越来越丰富，深度学习模型变得越来越大、越来越复杂，机器学习领域取得了重大进展。然而，准确度的提高[2]是以更长的训练时间为代价的。因此，OpenAI的GPT-2[18]或AlphaZero[20]等最先进的模型需要使用分布式系统或集群来加快训练速度。目前，分布式训练的求解方法有异步和同步两种。在本文中，我们实现了最先进的异步和同步求解器，然后在它们之间进行比较，以帮助读者为自己的应用选择最合适的求解器。我们解决了三个主要挑战:(1)实现可以胜过六种常见算法变体的异步求解器，(2)为具有不同计算模式的各种应用程序实现最先进的分布式性能，以及(3)保持大规模异步训练的准确性。对于异步算法，我们实现了一种称为EA-wild的算法，它结合了来自Hogwild的非锁定野生更新的思想![19]与EASGD。我们的实现能够扩展到217,600个内核，并在15分钟内在ImageNet上完成90个epoch的ResNet-50模型训练(在8个NVIDIA P100 gpu上基线需要29个小时)。我们得出的结论是，更复杂的模型(例如，ResNet-50)倾向于同步方法，而我们的异步求解器在具有低计算通信比的模型中优于同步求解器。本文记录了研究结果;更多的结果，读者可以参考我们的补充网站1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量