Scavenger: A Cloud Service for Optimizing Cost and Performance of DL Training

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW) Pub Date : 2023-05-01 DOI:10.1109/CCGridW59191.2023.00081

S. Tyagi

引用次数: 0

Abstract

Deep learning (DL) models learn non-linear functions and relationships by iteratively training on given data. To accelerate training further, data-parallel training [1] launches multiple instances of training process on separate partitions of data and periodically aggregates model updates. With the availability of VMs in the cloud, choosing the “right“ cluster configuration for data-parallel training presents non-trivial challenges. We tackle this problem by considering both the parallel and statistical efficiency of distributed training w.r.t. the cluster size configuration and batch-size in training. We build performance models to evaluate the pareto-relationship between cost and time of DL training across different cluster and batch-size configurations and develop Scavenger as a cloud service for searching optimum cloud configurations in an online, blackbox manner.

查看原文本刊更多论文

清道夫:优化深度学习训练成本和性能的云服务

深度学习(DL)模型通过对给定数据的迭代训练来学习非线性函数和关系。为了进一步加速训练，数据并行训练[1]在单独的数据分区上启动多个训练过程实例，并定期汇总模型更新。随着云中虚拟机的可用性，为数据并行训练选择“正确”的集群配置提出了不小的挑战。我们通过考虑分布式训练的并行效率和统计效率来解决这个问题，而不是考虑训练中的簇大小配置和批处理大小。我们建立了性能模型来评估不同集群和批处理配置的深度学习训练成本和时间之间的帕累托关系，并开发了Scavenger作为云服务，以在线黑箱方式搜索最佳云配置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)

自引率

0.00%

发文量