{"title":"Scavenger: A Cloud Service For Optimizing Cost and Performance of ML Training","authors":"S. Tyagi, Prateek Sharma","doi":"10.1109/CCGrid57682.2023.00045","DOIUrl":null,"url":null,"abstract":"Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the “right” cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the “right” cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.