FfDL:灵活的多租户深度学习平台

Proceedings of the 20th International Middleware Conference Pub Date : 2019-09-14 DOI:10.1145/3361525.3361538

K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf

{"title":"FfDL:灵活的多租户深度学习平台","authors":"K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf","doi":"10.1145/3361525.3361538","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale \"on-premise\" and \"cloud-hosted\" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.","PeriodicalId":381253,"journal":{"name":"Proceedings of the 20th International Middleware Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"FfDL: A Flexible Multi-tenant Deep Learning Platform\",\"authors\":\"K. R. Jayaram, Vinod Muthusamy, Parijat Dube, Vatche Isahagian, Chen Wang, Benjamin Herta, S. Boag, Diana Arroyo, A. Tantawi, Archit Verma, Falk Pollok, Rania Y. Khalaf\",\"doi\":\"10.1145/3361525.3361538\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale \\\"on-premise\\\" and \\\"cloud-hosted\\\" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.\",\"PeriodicalId\":381253,\"journal\":{\"name\":\"Proceedings of the 20th International Middleware Conference\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th International Middleware Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3361525.3361538\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th International Middleware Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3361525.3361538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

深度学习(Deep learning, DL)在多个应用领域越来越受欢迎，并使计算机视觉、语音识别与合成、自动驾驶汽车、药物设计等一些新的应用特征变得可行和准确。因此，大规模的“内部部署”和“云托管”深度学习平台已成为许多组织必不可少的基础设施。这些系统接受、安排、管理和执行大规模的深度学习训练任务。本文介绍了IBM的深度学习平台FfDL的设计、实现和使用经验。我们描述了我们的设计如何平衡可靠性与可扩展性、弹性、灵活性和效率。我们通过回顾从构建、运营和支持FfDL中学到的经验教训来定性地检查FfDL;并通过对FfDL进行详细的定量实证评估，包括平台为各种DL模型引入的开销，在我们组织中使用FfDL的实际案例研究中观察到的负载和性能，观察到的各种故障的频率，包括我们没有预料到的故障，以及展示各种调度策略好处的实验。FfDL是开源的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FfDL: A Flexible Multi-tenant Deep Learning Platform

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale "on-premise" and "cloud-hosted" deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various DL models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including faults that we did not anticipate, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th International Middleware Conference

自引率

0.00%

发文量