Anomaly Detection Based on Job Monitoring Metrics in Distributed System

2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC) Pub Date : 2018-12-01 DOI:10.1109/SPAC46244.2018.8965641

Meixiang Ding, Zhi Xiong, Jian Yu

{"title":"Anomaly Detection Based on Job Monitoring Metrics in Distributed System","authors":"Meixiang Ding, Zhi Xiong, Jian Yu","doi":"10.1109/SPAC46244.2018.8965641","DOIUrl":null,"url":null,"abstract":"In distributed systems, application delays caused by stragglers become a common problem. And interference by competing the resources can make more stragglers. Previous works mostly focus on straggler detection using statistical analysis methods based on the data extracted from logs. These methods cannot provide fine-grained insights to help users optimize their programs. In this paper, we propose an anomaly detection approach using classification method in machine learning based on job monitoring resource metrics. Due to interference, the change of metrics may vary randomly as the job progresses. In order to compare the metrics in different situation, we extract the job, stage and task information from the logs. From the point of system resource utilization, there are three kinds of anomalies we detect, which are the stragglers(tasks), the abnormal jobs and the interfered nodes. We prove that in most situation, more stragglers happen under interference, and the task time for defining stragglers is longer than that in the similar stage time, as well as the node that the abnormal jobs lived is the interfered nodes.We use the task time in the same stage to label the data for training the adaptive boosting classifier model solely with the resource features. In this way, the model can detect straggles, abnormal jobs and interfered nodes in real-time. Additionally, Experiments show that the accuracy of anomaly detection reaches 92%. Case studies show that our framework is effective in detecting abnormal jobs and interfered nodes.","PeriodicalId":360369,"journal":{"name":"2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPAC46244.2018.8965641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In distributed systems, application delays caused by stragglers become a common problem. And interference by competing the resources can make more stragglers. Previous works mostly focus on straggler detection using statistical analysis methods based on the data extracted from logs. These methods cannot provide fine-grained insights to help users optimize their programs. In this paper, we propose an anomaly detection approach using classification method in machine learning based on job monitoring resource metrics. Due to interference, the change of metrics may vary randomly as the job progresses. In order to compare the metrics in different situation, we extract the job, stage and task information from the logs. From the point of system resource utilization, there are three kinds of anomalies we detect, which are the stragglers(tasks), the abnormal jobs and the interfered nodes. We prove that in most situation, more stragglers happen under interference, and the task time for defining stragglers is longer than that in the similar stage time, as well as the node that the abnormal jobs lived is the interfered nodes.We use the task time in the same stage to label the data for training the adaptive boosting classifier model solely with the resource features. In this way, the model can detect straggles, abnormal jobs and interfered nodes in real-time. Additionally, Experiments show that the accuracy of anomaly detection reaches 92%. Case studies show that our framework is effective in detecting abnormal jobs and interfered nodes.

查看原文本刊更多论文

分布式系统中基于作业监控指标的异常检测

在分布式系统中，由掉队者引起的应用程序延迟成为一个常见的问题。资源竞争的干扰会产生更多的掉队者。以往的工作主要集中在基于测井数据提取的统计分析方法的离散体检测上。这些方法不能提供细粒度的洞察来帮助用户优化他们的程序。在本文中，我们提出了一种基于作业监控资源度量的机器学习分类方法的异常检测方法。由于干扰，指标的变化可能随工作的进展而随机变化。为了比较不同情况下的指标，我们从日志中提取作业、阶段和任务信息。从系统资源利用的角度来看，我们检测到的异常有三种:掉队任务、异常作业和受干扰节点。我们证明了在大多数情况下，在干扰下会出现更多的掉队现象，并且定义掉队者的任务时间比在相同阶段时间下的任务时间更长，并且异常作业所生存的节点就是被干扰的节点。我们使用同一阶段的任务时间对数据进行标记，以单独使用资源特征训练自适应增强分类器模型。这样，模型可以实时检测出掉队、异常作业和干扰节点。实验结果表明，该方法的异常检测准确率达到92%。实例研究表明，该框架在检测异常作业和干扰节点方面是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC)

自引率

0.00%

发文量