Studies on Job Queue Health and Problem Recovery

Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD) Pub Date : 2018-12-12 DOI:10.22323/1.327.0018

Xiaowei Jiang, Jiaheng Zou, Jingyan Shi, R. Du, Qingbao Hu, Zhenyu Sun, Hongnan Tan

{"title":"Studies on Job Queue Health and Problem Recovery","authors":"Xiaowei Jiang, Jiaheng Zou, Jingyan Shi, R. Du, Qingbao Hu, Zhenyu Sun, Hongnan Tan","doi":"10.22323/1.327.0018","DOIUrl":null,"url":null,"abstract":"In a batch system, the job queue is in charge of a set of jobs. Job health is the most important issue concerned by users and administrators. The job state can be queuing, running, completed, error or held, etc, that can reflect the job health. Generally jobs can move from one state to another. However, if a job keeps in a state for too long time, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. That means a number of jobs will be blocked in one state, and cannot be completed in an expected time. This will delay the progress of the computing task. For that situation, this paper studies on the unhealthy job state's reason, problem handling and job queue stability. We aim to improve the job health, and then we can improve job success rate and speed up users' task progress. Unhealthy reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. Depending on who do the recovery, all the solutions are grouped into two categories. The first category is recovered by administrator. Most problems are automatically solved through integrating with the monitor system. When problem is solved, the corresponding job will be rescheduled in time, without involving users. The second category is automatically informing users to dispose unhealthy jobs by themselves. In accordance with the results of unhealthy analysis, the helpful suggestion might be recommended to users for quick recovery. Based on the foregoing methods, a job queue health system is designed and implemented at IHEP. We define a series of standards to pick out unhealthy jobs. Various factors relevant with unhealthy jobs are collected and analyzed in association. In case that unhealthy jobs could be recovered at admin side, automatic recovery functions are carried out to automatically recover the unhealthy jobs. In case that unhealthy jobs must be recovered at user side, alarms are sent to users via emails, WeChat, etc. The running status of job queue health system indicates that it's able to improve the job queue health in most situations.","PeriodicalId":135658,"journal":{"name":"Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22323/1.327.0018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In a batch system, the job queue is in charge of a set of jobs. Job health is the most important issue concerned by users and administrators. The job state can be queuing, running, completed, error or held, etc, that can reflect the job health. Generally jobs can move from one state to another. However, if a job keeps in a state for too long time, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. That means a number of jobs will be blocked in one state, and cannot be completed in an expected time. This will delay the progress of the computing task. For that situation, this paper studies on the unhealthy job state's reason, problem handling and job queue stability. We aim to improve the job health, and then we can improve job success rate and speed up users' task progress. Unhealthy reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. Depending on who do the recovery, all the solutions are grouped into two categories. The first category is recovered by administrator. Most problems are automatically solved through integrating with the monitor system. When problem is solved, the corresponding job will be rescheduled in time, without involving users. The second category is automatically informing users to dispose unhealthy jobs by themselves. In accordance with the results of unhealthy analysis, the helpful suggestion might be recommended to users for quick recovery. Based on the foregoing methods, a job queue health system is designed and implemented at IHEP. We define a series of standards to pick out unhealthy jobs. Various factors relevant with unhealthy jobs are collected and analyzed in association. In case that unhealthy jobs could be recovered at admin side, automatic recovery functions are carried out to automatically recover the unhealthy jobs. In case that unhealthy jobs must be recovered at user side, alarms are sent to users via emails, WeChat, etc. The running status of job queue health system indicates that it's able to improve the job queue health in most situations.

查看原文本刊更多论文

作业队列健康与问题恢复研究

在批处理系统中，作业队列负责一组作业。作业运行状况是用户和管理员关心的最重要的问题。作业状态可以是排队、运行、完成、错误或保持等，这些状态可以反映作业的运行状况。一般来说，工作可以从一个州转移到另一个州。但是，如果作业长时间处于某种状态，可能会出现问题，例如工作节点故障和网络阻塞。在大规模的计算集群中，问题是无法避免的。这意味着许多作业将被阻塞在一个状态中，并且无法在预期的时间内完成。这将延迟计算任务的进度。针对这种情况，本文对作业状态不健康的原因、问题处理和作业队列稳定性进行了研究。我们的目标是提高作业的健康度，从而提高作业的成功率，加快用户的任务进度。从作业属性、队列信息和日志中可以发现不健康的原因，并对其进行详细分析，从而获得更好的解决方案。根据谁进行恢复，所有的解决方案被分为两类。第一个类别由管理员恢复。大部分问题通过与监控系统集成自动解决。当问题解决后，相应的作业将及时重新调度，不涉及用户。第二类是自动通知用户自行处理不健康的作业。根据不健康分析的结果，向用户推荐有用的建议，以便快速恢复。在上述方法的基础上，设计并实现了IHEP作业队列健康系统。我们制定了一系列标准来挑选不健康的工作。收集和分析了与不健康工作相关的各种因素。如果在管理端可以恢复不健康的作业，则执行自动恢复功能，自动恢复不健康的作业。如果需要在用户端恢复不健康的作业，则通过邮件、微信等方式向用户发送警报。作业队列健康系统的运行状态表明，它在大多数情况下都能够改善作业队列的健康状况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of International Symposium on Grids and Clouds 2018 in conjunction with Frontiers in Computational Drug Discovery — PoS(ISGC 2018 & FCDD)

自引率

0.00%

发文量