What does fault tolerant deep learning need from MPI?

Proceedings of the 24th European MPI Users' Group Meeting Pub Date : 2017-09-11 DOI:10.1145/3127024.3127037

Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily

{"title":"What does fault tolerant deep learning need from MPI?","authors":"Vinay C. Amatya, Abhinav Vishnu, C. Siegel, J. Daily","doi":"10.1145/3127024.3127037","DOIUrl":null,"url":null,"abstract":"Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for designing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.","PeriodicalId":118516,"journal":{"name":"Proceedings of the 24th European MPI Users' Group Meeting","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 24th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127024.3127037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for designing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.

查看原文本刊更多论文

容错深度学习需要MPI做什么?

深度学习(DL)算法已经成为大规模数据分析的事实上的机器学习(ML)算法。深度学习算法在计算上是昂贵的——即使使用MPI的分布式深度学习实现也需要在通常研究的数据集上花费数天的训练(模型学习)时间。长时间运行的深度学习应用程序容易出现故障——除了容错深度学习算法之外，还需要开发容错系统基础设施。这就提出了一个重要的问题:设计容错深度学习实现需要MPI提供什么?在本文中，我们解决了永久性故障的这个问题。我们通过深入考虑DL算法及其属性的最新创新来激发对容错MPI规范的需求，这些创新推动了对特定容错功能的需求。我们对不同并行类型(模型、数据和混合)的适用性进行了深入讨论;需要(或缺乏)任何关键数据结构的检查点;最重要的是，考虑了MPI中的几种容错建议(用户级故障缓解(ULFM)， Reinit)及其在容错DL实现中的适用性。我们利用了Caffe的分布式内存实现，目前在极端规模机器学习工具包(MaTEx)下可用。我们通过扩展MaTEx-Caffe来使用基于ulfm的实现来实现我们的方法。我们使用ImageNet数据集和AlexNet以及GoogLeNet神经网络拓扑进行评估，证明了使用基于OpenMPI的ULFM实现所提出的容错深度学习的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 24th European MPI Users' Group Meeting

自引率

0.00%

发文量