Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing

K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan
{"title":"Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing","authors":"K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan","doi":"10.1109/ICPADS53394.2021.00061","DOIUrl":null,"url":null,"abstract":"Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.
高性能计算中故障感知作业调度的深度强化代理
作业调度在高性能计算(HPC)中至关重要,它致力于通过考虑多个调度目标来决定何时以及将哪些作业分配给系统,并将作业放置在哪些资源上。随着各种资源的增加和令人眼花缭乱的深度学习训练(DLT)工作量,作业失败成为高性能计算中一个相当普遍的问题,它将影响用户满意度和集群利用率。为了尽可能地减轻硬件和软件错误的影响,本文旨在解决高性能计算集群中故障感知作业调度问题。受以往深度强化学习驱动作业调度成功研究的启发,我们提出了一种考虑作业失败影响的新型高性能计算调度代理FARS (Failure-aware RL-based scheduler)。一方面,利用神经网络将原始集群和工作状态信息映射到工作分配决策中。另一方面,为了考虑作业失败对用户满意度和集群利用率的影响,FARS利用整个工作量的make-span作为培训目标。此外,还采用了有效的探索和经验回放技术来获得有效的聚合代理。为了评估FARS的能力,我们使用流行的DLT工作负载设计了广泛的基于跟踪的仿真实验。实验结果表明,与最佳基线模型相比,在不同设备错误率下,FARS的平均制作跨度提高了5.69%。总之,我们的FARS是HPC集群中故障感知作业调度器的理想候选。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信