{"title":"Scavenger: A Cloud Service for Optimizing Cost and Performance of DL Training","authors":"S. Tyagi","doi":"10.1109/CCGridW59191.2023.00081","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the “right“ cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.","PeriodicalId":341115,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGridW59191.2023.00081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the “right“ cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.