A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2023-11-24 DOI:10.1109/TCC.2023.3336540

Federica Filippini;Jonatha Anselmi;Danilo Ardagna;Bruno Gaujal

{"title":"A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems","authors":"Federica Filippini;Jonatha Anselmi;Danilo Ardagna;Bruno Gaujal","doi":"10.1109/TCC.2023.3336540","DOIUrl":null,"url":null,"abstract":"In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17–29% on average.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":5.3000,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10328678/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17–29% on average.

查看原文本刊更多论文

在基于 GPU 的系统中调度人工智能训练任务的随机方法

在这项工作中，我们从运行数据中心的云服务提供商的角度出发，优化了深度学习（DL）训练作业的调度，为每个作业的执行有效地选择资源，以在满足时间限制的同时最大限度地降低平均能耗。为了给问题建模，我们首先开发了一种混合整数非线性编程公式。不幸的是，计算最优解的成本过高，为了克服这一困难，我们设计了启发式随机调度程序（STS）。利用提前终止的概率分布，STS 可以确定如何在作业执行过程中调整资源分配，从而在满足作业到期日期的同时，最大限度地降低预期能源成本。广泛的实验评估结果表明，与文献中的其他方法相比，STS 能保证明显更好的结果，有效避免违反到期日期的情况，平均降低总成本的百分比在 32% 到 80% 之间。我们还证明了我们的方法在实际场景中的适用性，因为在多达 100 个节点和 400 个并发作业的系统中，获得最佳日程安排所需的时间不到 5 秒。最后，我们评估了 GPU 共享（即在单个 GPU 上运行多个作业）的有效性。结果表明，根据工作负载和 GPU 内存的不同，这种方法平均可进一步降低 17-29% 的能源成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.