优化云中的按需gpu，用于深度学习应用程序训练

2019 4th International Conference on Computing, Communications and Security (ICCCS) Pub Date : 2019-10-01 DOI:10.1109/CCCS.2019.8888151

A. Jahani, M. Lattuada, M. Ciavotta, D. Ardagna, E. Amaldi, Li Zhang

{"title":"优化云中的按需gpu，用于深度学习应用程序训练","authors":"A. Jahani, M. Lattuada, M. Ciavotta, D. Ardagna, E. Amaldi, Li Zhang","doi":"10.1109/CCCS.2019.8888151","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) methods have recently gained popularity and been used in commonplace applications; voice and face recognition, among the others. Despite the growing popularity of DL and the associated hardware acceleration techniques, GPU-based systems still have very high costs. Moreover, while the cloud represents a cost-effective and flexible solution, in large settings operations costs can be further optimized by carefully managing and fostering resource sharing. This work addresses the online joint problem of capacity planning of virtual machines (VMs) and DL training jobs scheduling, and proposes a Mixed Integer Linear Programming (MILP) formulation. In particular, DL jobs are assumed to feature a deadline, while multiple VM types are available from a cloud provider catalog, and each VM has, possibly, multiple GPUs. Our solutions optimize the operations costs by (i) right-sizing the VM capacities; (ii) partitioning the set of GPUs among multiple concurrent jobs running on the same VM, and (iii) determining a deadline-aware job schedule. Our approach is evaluated using an ad-hoc simulator and a prototype environment, and compared against first-principle approaches, resulting in a cost reduction of 45-80%.","PeriodicalId":152148,"journal":{"name":"2019 4th International Conference on Computing, Communications and Security (ICCCS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training\",\"authors\":\"A. Jahani, M. Lattuada, M. Ciavotta, D. Ardagna, E. Amaldi, Li Zhang\",\"doi\":\"10.1109/CCCS.2019.8888151\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning (DL) methods have recently gained popularity and been used in commonplace applications; voice and face recognition, among the others. Despite the growing popularity of DL and the associated hardware acceleration techniques, GPU-based systems still have very high costs. Moreover, while the cloud represents a cost-effective and flexible solution, in large settings operations costs can be further optimized by carefully managing and fostering resource sharing. This work addresses the online joint problem of capacity planning of virtual machines (VMs) and DL training jobs scheduling, and proposes a Mixed Integer Linear Programming (MILP) formulation. In particular, DL jobs are assumed to feature a deadline, while multiple VM types are available from a cloud provider catalog, and each VM has, possibly, multiple GPUs. Our solutions optimize the operations costs by (i) right-sizing the VM capacities; (ii) partitioning the set of GPUs among multiple concurrent jobs running on the same VM, and (iii) determining a deadline-aware job schedule. Our approach is evaluated using an ad-hoc simulator and a prototype environment, and compared against first-principle approaches, resulting in a cost reduction of 45-80%.\",\"PeriodicalId\":152148,\"journal\":{\"name\":\"2019 4th International Conference on Computing, Communications and Security (ICCCS)\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 4th International Conference on Computing, Communications and Security (ICCCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCCS.2019.8888151\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 4th International Conference on Computing, Communications and Security (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCCS.2019.8888151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

深度学习(DL)方法最近得到了普及，并在常见的应用中得到了应用;语音和面部识别等等。尽管DL和相关的硬件加速技术越来越受欢迎，但基于gpu的系统仍然具有非常高的成本。此外，虽然云代表了一种经济高效且灵活的解决方案，但在大型环境中，可以通过仔细管理和促进资源共享来进一步优化运营成本。本文解决了虚拟机(vm)容量规划和深度学习训练作业调度的在线联合问题，并提出了一个混合整数线性规划(MILP)公式。特别是，假定DL作业具有截止日期，而从云提供商目录中可以获得多种VM类型，并且每个VM可能有多个gpu。我们的解决方案通过以下方式优化运营成本:(1)正确调整虚拟机容量;(ii)在同一VM上运行的多个并发作业之间对gpu集进行分区，以及(iii)确定一个截止日期感知的作业计划。我们的方法使用特设模拟器和原型环境进行了评估，并与第一原理方法进行了比较，结果成本降低了45-80%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing on-demand GPUs in the Cloud for Deep Learning Applications Training

Deep learning (DL) methods have recently gained popularity and been used in commonplace applications; voice and face recognition, among the others. Despite the growing popularity of DL and the associated hardware acceleration techniques, GPU-based systems still have very high costs. Moreover, while the cloud represents a cost-effective and flexible solution, in large settings operations costs can be further optimized by carefully managing and fostering resource sharing. This work addresses the online joint problem of capacity planning of virtual machines (VMs) and DL training jobs scheduling, and proposes a Mixed Integer Linear Programming (MILP) formulation. In particular, DL jobs are assumed to feature a deadline, while multiple VM types are available from a cloud provider catalog, and each VM has, possibly, multiple GPUs. Our solutions optimize the operations costs by (i) right-sizing the VM capacities; (ii) partitioning the set of GPUs among multiple concurrent jobs running on the same VM, and (iii) determining a deadline-aware job schedule. Our approach is evaluated using an ad-hoc simulator and a prototype environment, and compared against first-principle approaches, resulting in a cost reduction of 45-80%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 4th International Conference on Computing, Communications and Security (ICCCS)

自引率

0.00%

发文量