Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2021-12-01 DOI:10.1109/ICPADS53394.2021.00061

K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan

{"title":"Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing","authors":"K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan","doi":"10.1109/ICPADS53394.2021.00061","DOIUrl":null,"url":null,"abstract":"Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.

查看原文本刊更多论文

高性能计算中故障感知作业调度的深度强化代理

作业调度在高性能计算(HPC)中至关重要，它致力于通过考虑多个调度目标来决定何时以及将哪些作业分配给系统，并将作业放置在哪些资源上。随着各种资源的增加和令人眼花缭乱的深度学习训练(DLT)工作量，作业失败成为高性能计算中一个相当普遍的问题，它将影响用户满意度和集群利用率。为了尽可能地减轻硬件和软件错误的影响，本文旨在解决高性能计算集群中故障感知作业调度问题。受以往深度强化学习驱动作业调度成功研究的启发，我们提出了一种考虑作业失败影响的新型高性能计算调度代理FARS (Failure-aware RL-based scheduler)。一方面，利用神经网络将原始集群和工作状态信息映射到工作分配决策中。另一方面，为了考虑作业失败对用户满意度和集群利用率的影响，FARS利用整个工作量的make-span作为培训目标。此外，还采用了有效的探索和经验回放技术来获得有效的聚合代理。为了评估FARS的能力，我们使用流行的DLT工作负载设计了广泛的基于跟踪的仿真实验。实验结果表明，与最佳基线模型相比，在不同设备错误率下，FARS的平均制作跨度提高了5.69%。总之，我们的FARS是HPC集群中故障感知作业调度器的理想候选。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量