{"title":"Anomaly Detection and Classification using Distributed Tracing and Deep Learning","authors":"S. Nedelkoski, Jorge Cardoso, O. Kao","doi":"10.1109/CCGRID.2019.00038","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence for IT Operations (AIOps) combines big data and machine learning to replace a broad range of IT Operations tasks including availability, performance, and monitoring of services. By exploiting log, tracing, metric, and network data, AIOps enable detection of faults and issues of services. The focus of this work is on detecting anomalies based on distributed tracing records that contain detailed information for the availability and the response time of the services. In large-scale distributed systems, where a service is deployed on heterogeneous hardware and has multiple scenarios of normal operation, it becomes challenging to detect such anomalous cases. We address the problem by proposing unsupervised, response time anomaly detection based on deep learning data modeling techniques; unsupervised dynamic error threshold approach; tolerance module for false positive reduction; and descriptive classification of the anomalies. The evaluation shows that the approach achieves high accuracy and solid performance in both, experimental testbed and large-scale production cloud.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40
Abstract
Artificial Intelligence for IT Operations (AIOps) combines big data and machine learning to replace a broad range of IT Operations tasks including availability, performance, and monitoring of services. By exploiting log, tracing, metric, and network data, AIOps enable detection of faults and issues of services. The focus of this work is on detecting anomalies based on distributed tracing records that contain detailed information for the availability and the response time of the services. In large-scale distributed systems, where a service is deployed on heterogeneous hardware and has multiple scenarios of normal operation, it becomes challenging to detect such anomalous cases. We address the problem by proposing unsupervised, response time anomaly detection based on deep learning data modeling techniques; unsupervised dynamic error threshold approach; tolerance module for false positive reduction; and descriptive classification of the anomalies. The evaluation shows that the approach achieves high accuracy and solid performance in both, experimental testbed and large-scale production cloud.