Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-25 DOI:10.48550/arXiv.2306.14086

Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang

{"title":"Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning","authors":"Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang","doi":"10.48550/arXiv.2306.14086","DOIUrl":null,"url":null,"abstract":"Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.14086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.

查看原文本刊更多论文

幻影:通过强化学习实现批处理GPU集群上的低中断服务

在使用传统批处理调度器(如Slurm)的GPU集群上，适应长时间运行的深度学习(DL)训练和推理工作是一项挑战。给定固定的时钟时间限制，深度学习研究人员通常需要运行一系列批处理作业，并在过载的机器上经历长时间的中断。这种中断显著降低了在生产环境中部署的服务的研究效率和QoS。为了缓解中断带来的问题，我们提出了一个主动提供程序的设计，并研究了一组统计学习和强化学习(RL)技术，包括随机森林、xgboost、Deep Q-Network和策略梯度。使用来自三个GPU集群的生产作业跟踪，我们使用跟踪的子集训练每个模型，然后使用剩余的验证子集评估它们的通用性。我们介绍Mirage，这是一个与slurm兼容的资源提供程序，它集成了候选ML方法。我们的实验表明，Mirage可以减少17- 100%的中断，并在三个集群的不同负载水平上保护23%-76%的作业零中断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量