Root cause detection in a service-oriented architecture

Measurement and Modeling of Computer Systems Pub Date : 2013-06-17 DOI:10.1145/2465529.2465753

Myunghwan Kim, Roshan Sumbaly, Sam Shah

{"title":"Root cause detection in a service-oriented architecture","authors":"Myunghwan Kim, Roshan Sumbaly, Sam Shah","doi":"10.1145/2465529.2465753","DOIUrl":null,"url":null,"abstract":"Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.\n This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.","PeriodicalId":306456,"journal":{"name":"Measurement and Modeling of Computer Systems","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"113","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2465529.2465753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 113

Abstract

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers. This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

查看原文本刊更多论文

面向服务的体系结构中的根本原因检测

大型网站主要是作为面向服务的体系结构构建的。在这里，服务专门用于特定任务，在多台机器上运行，并相互通信以满足用户的请求。在此通信期间，一个服务度量的异常更改可能会传播到其他服务，从而导致请求的整体降级。由于任何此类降级都会影响收入，因此维护正确的功能是最重要的问题:尽快找到任何异常的根本原因非常重要。这是具有挑战性的，因为给定的服务有许多指标或传感器，而现代网站通常由数百个服务组成，这些服务在多个数据中心的数千台机器上运行。本文介绍了MonitorRank算法，它可以减少在这种面向服务的体系结构中查找异常的根本原因所需的时间、领域知识和人力。在出现异常的情况下，MonitorRank提供了一个可能的根本原因排序列表，供监控团队进行调查。MonitorRank使用每个传感器的历史和当前时间序列指标作为输入，以及传感器之间生成的调用图来构建一个无监督的排名模型。对LinkedIn(最大的在线社交网络之一)的实际生产中断数据进行的实验表明，与基线和当前最先进的方法相比，在查找根本原因方面的平均精度提高了26%至51%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Measurement and Modeling of Computer Systems

自引率

0.00%

发文量