ElasticDL: A Kubernetes-native Deep Learning Framework with Fault-tolerance and Elastic Scheduling

Jun Zhou, Ke Zhang, Feng Zhu, Qitao Shi, Wenjing Fang, Lin Wang, Yi Wang
{"title":"ElasticDL: A Kubernetes-native Deep Learning Framework with Fault-tolerance and Elastic Scheduling","authors":"Jun Zhou, Ke Zhang, Feng Zhu, Qitao Shi, Wenjing Fang, Lin Wang, Yi Wang","doi":"10.1145/3539597.3573037","DOIUrl":null,"url":null,"abstract":"The power of artificial intelligence (AI) models originates with sophisticated model architecture as well as the sheer size of the model. These large-scale AI models impose new and challenging system requirements regarding scalability, reliability, and flexibility. One of the most promising solutions in the industry is to train these large-scale models on distributed deep-learning frameworks. With the power of all distributed computations, it is desired to achieve a training process with excellent scalability, elastic scheduling (flexibility), and fault tolerance (reliability). In this paper, we demonstrate the scalability, flexibility, and reliability of our open-source Elastic Deep Learning (ElasticDL) framework. Our ElasticDL utilizes an open-source system, i.e., Kubernetes, for automating deployment, scaling, and management of containerized application features to provide fault tolerance and support elastic scheduling for DL tasks.","PeriodicalId":227804,"journal":{"name":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539597.3573037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The power of artificial intelligence (AI) models originates with sophisticated model architecture as well as the sheer size of the model. These large-scale AI models impose new and challenging system requirements regarding scalability, reliability, and flexibility. One of the most promising solutions in the industry is to train these large-scale models on distributed deep-learning frameworks. With the power of all distributed computations, it is desired to achieve a training process with excellent scalability, elastic scheduling (flexibility), and fault tolerance (reliability). In this paper, we demonstrate the scalability, flexibility, and reliability of our open-source Elastic Deep Learning (ElasticDL) framework. Our ElasticDL utilizes an open-source system, i.e., Kubernetes, for automating deployment, scaling, and management of containerized application features to provide fault tolerance and support elastic scheduling for DL tasks.
ElasticDL:一个kubernetes原生的深度学习框架,具有容错和弹性调度功能
人工智能(AI)模型的强大源于复杂的模型架构以及模型的庞大规模。这些大规模的人工智能模型在可扩展性、可靠性和灵活性方面提出了新的和具有挑战性的系统要求。业界最有前途的解决方案之一是在分布式深度学习框架上训练这些大规模模型。利用所有分布式计算的能力,期望实现具有良好可伸缩性、弹性调度(灵活性)和容错(可靠性)的训练过程。在本文中,我们展示了开源弹性深度学习(ElasticDL)框架的可扩展性、灵活性和可靠性。我们的ElasticDL利用一个开源系统,即Kubernetes,来自动化部署、扩展和管理容器化应用程序的特性,为DL任务提供容错和弹性调度支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信