K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan
{"title":"Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing","authors":"K. Yang, Rongyu Cao, Yueyuan Zhou, Jiawei Zhang, En Shao, Guangming Tan","doi":"10.1109/ICPADS53394.2021.00061","DOIUrl":null,"url":null,"abstract":"Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Job scheduling is crucial in high-performance computing (HPC), which is dedicated to deciding when and which jobs are allocated to the system and placing the jobs on which resources, by considering multiple scheduling goals. Along with the incremental of various resources and dazzling deep learning training (DLT) workloads, job failure becomes a quite common issue in HPC, which will affect user satisfaction and cluster utilization. To alleviate the influence of hardware and software errors as much as possible, in this paper, we aim to tackle the problem of failure-aware job scheduling in HPC clusters. Inspired by the success of previous studies of deep reinforcement learning-driven job scheduling, we propose a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures. On the one hand, a neural network is applied to map the information of raw cluster and job states to job placement decisions. On the other hand, to consider the influence of job failure for user satisfaction and cluster utilization, FARS leverages make-span of the entire workload as the training objective. Additionally, effective exploration and experience replay techniques are applied to obtain effectively converged agent. To evaluate the capability of FARS, we design extensive trace-based simulation experiments with the popular DLT workloads. The experimental results show that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates. Together, our FARS is an ideal candidate for failure-aware job scheduler in HPC clusters.