Scavenger: A Cloud Service For Optimizing Cost and Performance of ML Training

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) Pub Date : 2023-03-12 DOI:10.1109/CCGrid57682.2023.00045

S. Tyagi, Prateek Sharma

{"title":"Scavenger: A Cloud Service For Optimizing Cost and Performance of ML Training","authors":"S. Tyagi, Prateek Sharma","doi":"10.1109/CCGrid57682.2023.00045","DOIUrl":null,"url":null,"abstract":"Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the “right” cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the “right” cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.

查看原文本刊更多论文

清道夫:优化机器学习培训成本和性能的云服务

云计算平台可以提供训练大型机器学习模型(如深度神经网络)所需的计算资源。虽然云虚拟机(vm)的即用即付性质使得为训练模型启动大型集群变得容易，但它也可能导致成本膨胀。云平台提供的数百个虚拟机大小也使得为培训选择“正确”的云集群配置非常具有挑战性。此外，分布式模型训练的时间和成本对聚类配置高度敏感，并且存在较大而复杂的权衡空间。在本文中，我们开发了原则和实用的技术来优化云上分布式ML模型训练的训练时间和成本。我们的关键见解是，在选择最佳作业配置参数(如工人数量和批大小)时，必须考虑并行效率和统计效率。通过结合传统的并行缩放概念和对SGD噪声的新见解，我们开发了用于估计不同集群配置下的时间和成本的模型。利用训练的重复性和我们的性能模型，我们的清道夫云服务可以以黑盒在线方式搜索最佳的云配置。我们的方法将培训时间减少了2倍，成本减少了50%以上。我们的性能模型精确到2%以内，与理想的基于oracle的方法相比，我们的搜索只增加了10%的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

自引率

0.00%

发文量