A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Federica Filippini;Jonatha Anselmi;Danilo Ardagna;Bruno Gaujal
{"title":"A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems","authors":"Federica Filippini;Jonatha Anselmi;Danilo Ardagna;Bruno Gaujal","doi":"10.1109/TCC.2023.3336540","DOIUrl":null,"url":null,"abstract":"In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17–29% on average.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":5.3000,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10328678/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17–29% on average.
在基于 GPU 的系统中调度人工智能训练任务的随机方法
在这项工作中,我们从运行数据中心的云服务提供商的角度出发,优化了深度学习(DL)训练作业的调度,为每个作业的执行有效地选择资源,以在满足时间限制的同时最大限度地降低平均能耗。为了给问题建模,我们首先开发了一种混合整数非线性编程公式。不幸的是,计算最优解的成本过高,为了克服这一困难,我们设计了启发式随机调度程序(STS)。利用提前终止的概率分布,STS 可以确定如何在作业执行过程中调整资源分配,从而在满足作业到期日期的同时,最大限度地降低预期能源成本。广泛的实验评估结果表明,与文献中的其他方法相比,STS 能保证明显更好的结果,有效避免违反到期日期的情况,平均降低总成本的百分比在 32% 到 80% 之间。我们还证明了我们的方法在实际场景中的适用性,因为在多达 100 个节点和 400 个并发作业的系统中,获得最佳日程安排所需的时间不到 5 秒。最后,我们评估了 GPU 共享(即在单个 GPU 上运行多个作业)的有效性。结果表明,根据工作负载和 GPU 内存的不同,这种方法平均可进一步降低 17-29% 的能源成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Cloud Computing
IEEE Transactions on Cloud Computing Computer Science-Software
CiteScore
9.40
自引率
6.20%
发文量
167
期刊介绍: The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信