Xiaowei Jiang, Jiaheng Zou, Jingyan Shi, R. Du, Qingbao Hu, Zhenyu Sun, Hongnan Tan
{"title":"Studies on Job Queue Health and Problem Recovery","authors":"Xiaowei Jiang, Jiaheng Zou, Jingyan Shi, R. Du, Qingbao Hu, Zhenyu Sun, Hongnan Tan","doi":"10.22323/1.327.0018","DOIUrl":null,"url":null,"abstract":"In a batch system, the job queue is in charge of a set of jobs. Job health is the most important issue concerned by users and administrators. The job state can be queuing, running, completed, error or held, etc, that can reflect the job health. Generally jobs can move from one state to another. However, if a job keeps in a state for too long time, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. That means a number of jobs will be blocked in one state, and cannot be completed in an expected time. This will delay the progress of the computing task. For that situation, this paper studies on the unhealthy job state's reason, problem handling and job queue stability. We aim to improve the job health, and then we can improve job success rate and speed up users' task progress. Unhealthy reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. Depending on who do the recovery, all the solutions are grouped into two categories. The first category is recovered by administrator. Most problems are automatically solved through integrating with the monitor system. When problem is solved, the corresponding job will be rescheduled in time, without involving users. The second category is automatically informing users to dispose unhealthy jobs by themselves. In accordance with the results of unhealthy analysis, the helpful suggestion might be recommended to users for quick recovery. Based on the foregoing methods, a job queue health system is designed and implemented at IHEP. We define a series of standards to pick out unhealthy jobs. Various factors relevant with unhealthy jobs are collected and analyzed in association. In case that unhealthy jobs could be recovered at admin side, automatic recovery functions are carried out to automatically recover the unhealthy jobs. In case that unhealthy jobs must be recovered at user side, alarms are sent to users via emails, WeChat, etc. The running status of job queue health system indicates that it's able to improve the job queue health in most situations.","PeriodicalId":135658,"journal":{"name":"Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22323/1.327.0018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In a batch system, the job queue is in charge of a set of jobs. Job health is the most important issue concerned by users and administrators. The job state can be queuing, running, completed, error or held, etc, that can reflect the job health. Generally jobs can move from one state to another. However, if a job keeps in a state for too long time, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. That means a number of jobs will be blocked in one state, and cannot be completed in an expected time. This will delay the progress of the computing task. For that situation, this paper studies on the unhealthy job state's reason, problem handling and job queue stability. We aim to improve the job health, and then we can improve job success rate and speed up users' task progress. Unhealthy reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. Depending on who do the recovery, all the solutions are grouped into two categories. The first category is recovered by administrator. Most problems are automatically solved through integrating with the monitor system. When problem is solved, the corresponding job will be rescheduled in time, without involving users. The second category is automatically informing users to dispose unhealthy jobs by themselves. In accordance with the results of unhealthy analysis, the helpful suggestion might be recommended to users for quick recovery. Based on the foregoing methods, a job queue health system is designed and implemented at IHEP. We define a series of standards to pick out unhealthy jobs. Various factors relevant with unhealthy jobs are collected and analyzed in association. In case that unhealthy jobs could be recovered at admin side, automatic recovery functions are carried out to automatically recover the unhealthy jobs. In case that unhealthy jobs must be recovered at user side, alarms are sent to users via emails, WeChat, etc. The running status of job queue health system indicates that it's able to improve the job queue health in most situations.