FfDL:灵活的多租户深度学习平台

K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf
{"title":"FfDL:灵活的多租户深度学习平台","authors":"K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf","doi":"10.1145/3361525.3361538","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale \"on-premise\" and \"cloud-hosted\" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.","PeriodicalId":381253,"journal":{"name":"Proceedings of the 20th International Middleware Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"FfDL: A Flexible Multi-tenant Deep Learning Platform\",\"authors\":\"K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf\",\"doi\":\"10.1145/3361525.3361538\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale \\\"on-premise\\\" and \\\"cloud-hosted\\\" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.\",\"PeriodicalId\":381253,\"journal\":{\"name\":\"Proceedings of the 20th International Middleware Conference\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th International Middleware Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3361525.3361538\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th International Middleware Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3361525.3361538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

摘要

深度学习(Deep learning, DL)在多个应用领域越来越受欢迎,并使计算机视觉、语音识别与合成、自动驾驶汽车、药物设计等一些新的应用特征变得可行和准确。因此,大规模的“内部部署”和“云托管”深度学习平台已成为许多组织必不可少的基础设施。这些系统接受、安排、管理和执行大规模的深度学习训练任务。本文介绍了IBM的深度学习平台FfDL的设计、实现和使用经验。我们描述了我们的设计如何平衡可靠性与可扩展性、弹性、灵活性和效率。我们通过回顾从构建、运营和支持FfDL中学到的经验教训来定性地检查FfDL;并通过对FfDL进行详细的定量实证评估,包括平台为各种DL模型引入的开销,在我们组织中使用FfDL的实际案例研究中观察到的负载和性能,观察到的各种故障的频率,包括我们没有预料到的故障,以及展示各种调度策略好处的实验。FfDL是开源的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
FfDL: A Flexible Multi-tenant Deep Learning Platform
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale "on-premise" and "cloud-hosted" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信