Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang
{"title":"Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning","authors":"Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang","doi":"10.48550/arXiv.2306.14086","DOIUrl":null,"url":null,"abstract":"Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.14086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
幻影:通过强化学习实现批处理GPU集群上的低中断服务
在使用传统批处理调度器(如Slurm)的GPU集群上,适应长时间运行的深度学习(DL)训练和推理工作是一项挑战。给定固定的时钟时间限制,深度学习研究人员通常需要运行一系列批处理作业,并在过载的机器上经历长时间的中断。这种中断显著降低了在生产环境中部署的服务的研究效率和QoS。为了缓解中断带来的问题,我们提出了一个主动提供程序的设计,并研究了一组统计学习和强化学习(RL)技术,包括随机森林、xgboost、Deep Q-Network和策略梯度。使用来自三个GPU集群的生产作业跟踪,我们使用跟踪的子集训练每个模型,然后使用剩余的验证子集评估它们的通用性。我们介绍Mirage,这是一个与slurm兼容的资源提供程序,它集成了候选ML方法。我们的实验表明,Mirage可以减少17- 100%的中断,并在三个集群的不同负载水平上保护23%-76%的作业零中断。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信