{"title":"Anomaly Detection Based on Job Monitoring Metrics in Distributed System","authors":"Meixiang Ding, Zhi Xiong, Jian Yu","doi":"10.1109/SPAC46244.2018.8965641","DOIUrl":null,"url":null,"abstract":"In distributed systems, application delays caused by stragglers become a common problem. And interference by competing the resources can make more stragglers. Previous works mostly focus on straggler detection using statistical analysis methods based on the data extracted from logs. These methods cannot provide fine-grained insights to help users optimize their programs. In this paper, we propose an anomaly detection approach using classification method in machine learning based on job monitoring resource metrics. Due to interference, the change of metrics may vary randomly as the job progresses. In order to compare the metrics in different situation, we extract the job, stage and task information from the logs. From the point of system resource utilization, there are three kinds of anomalies we detect, which are the stragglers(tasks), the abnormal jobs and the interfered nodes. We prove that in most situation, more stragglers happen under interference, and the task time for defining stragglers is longer than that in the similar stage time, as well as the node that the abnormal jobs lived is the interfered nodes.We use the task time in the same stage to label the data for training the adaptive boosting classifier model solely with the resource features. In this way, the model can detect straggles, abnormal jobs and interfered nodes in real-time. Additionally, Experiments show that the accuracy of anomaly detection reaches 92%. Case studies show that our framework is effective in detecting abnormal jobs and interfered nodes.","PeriodicalId":360369,"journal":{"name":"2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPAC46244.2018.8965641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In distributed systems, application delays caused by stragglers become a common problem. And interference by competing the resources can make more stragglers. Previous works mostly focus on straggler detection using statistical analysis methods based on the data extracted from logs. These methods cannot provide fine-grained insights to help users optimize their programs. In this paper, we propose an anomaly detection approach using classification method in machine learning based on job monitoring resource metrics. Due to interference, the change of metrics may vary randomly as the job progresses. In order to compare the metrics in different situation, we extract the job, stage and task information from the logs. From the point of system resource utilization, there are three kinds of anomalies we detect, which are the stragglers(tasks), the abnormal jobs and the interfered nodes. We prove that in most situation, more stragglers happen under interference, and the task time for defining stragglers is longer than that in the similar stage time, as well as the node that the abnormal jobs lived is the interfered nodes.We use the task time in the same stage to label the data for training the adaptive boosting classifier model solely with the resource features. In this way, the model can detect straggles, abnormal jobs and interfered nodes in real-time. Additionally, Experiments show that the accuracy of anomaly detection reaches 92%. Case studies show that our framework is effective in detecting abnormal jobs and interfered nodes.