Anomaly Detection and Classification using Distributed Tracing and Deep Learning

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-01 DOI:10.1109/CCGRID.2019.00038

S. Nedelkoski, Jorge Cardoso, O. Kao

{"title":"Anomaly Detection and Classification using Distributed Tracing and Deep Learning","authors":"S. Nedelkoski, Jorge Cardoso, O. Kao","doi":"10.1109/CCGRID.2019.00038","DOIUrl":null,"url":null,"abstract":"Artificial Intelligence for IT Operations (AIOps) combines big data and machine learning to replace a broad range of IT Operations tasks including availability, performance, and monitoring of services. By exploiting log, tracing, metric, and network data, AIOps enable detection of faults and issues of services. The focus of this work is on detecting anomalies based on distributed tracing records that contain detailed information for the availability and the response time of the services. In large-scale distributed systems, where a service is deployed on heterogeneous hardware and has multiple scenarios of normal operation, it becomes challenging to detect such anomalous cases. We address the problem by proposing unsupervised, response time anomaly detection based on deep learning data modeling techniques; unsupervised dynamic error threshold approach; tolerance module for false positive reduction; and descriptive classification of the anomalies. The evaluation shows that the approach achieves high accuracy and solid performance in both, experimental testbed and large-scale production cloud.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00038","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

Abstract

Artificial Intelligence for IT Operations (AIOps) combines big data and machine learning to replace a broad range of IT Operations tasks including availability, performance, and monitoring of services. By exploiting log, tracing, metric, and network data, AIOps enable detection of faults and issues of services. The focus of this work is on detecting anomalies based on distributed tracing records that contain detailed information for the availability and the response time of the services. In large-scale distributed systems, where a service is deployed on heterogeneous hardware and has multiple scenarios of normal operation, it becomes challenging to detect such anomalous cases. We address the problem by proposing unsupervised, response time anomaly detection based on deep learning data modeling techniques; unsupervised dynamic error threshold approach; tolerance module for false positive reduction; and descriptive classification of the anomalies. The evaluation shows that the approach achieves high accuracy and solid performance in both, experimental testbed and large-scale production cloud.

查看原文本刊更多论文

基于分布式跟踪和深度学习的异常检测和分类

IT运营人工智能(AIOps)将大数据和机器学习相结合，以取代广泛的IT运营任务，包括可用性、性能和服务监控。通过利用日志、跟踪、度量和网络数据，AIOps可以检测服务的故障和问题。这项工作的重点是基于分布式跟踪记录检测异常，这些记录包含服务的可用性和响应时间的详细信息。在大规模分布式系统中，服务部署在异构硬件上，并且具有多种正常操作场景，因此检测此类异常情况变得具有挑战性。我们通过提出基于深度学习数据建模技术的无监督响应时间异常检测来解决这个问题;无监督动态误差阈值法;减少假阳性的容忍模块;以及异常的描述性分类。结果表明，该方法在实验测试平台和大规模生产云中均具有较高的精度和稳定的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量