Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.16

Moshe Gabel, A. Schuster, D. Keren

{"title":"Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series","authors":"Moshe Gabel, A. Schuster, D. Keren","doi":"10.1109/IPDPS.2014.16","DOIUrl":null,"url":null,"abstract":"Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.

查看原文本刊更多论文

多变量时间序列的高效通信分布式方差监测与离群点检测

现代的横向扩展服务由数千台独立的机器组成，必须持续监控这些机器，以防出现意外故障。最近的一种监测方法是潜在故障检测，这是一种用于向外扩展、负载平衡系统的自适应统计框架。通过定期测量数百个性能指标并查找异常机器，它试图在错误配置、错误和硬件故障等细微问题表现为机器故障之前检测它们。以前对大型、真实的Web服务的研究表明，在许多故障之前确实存在这样的潜在错误。潜在故障检测是一种带宽大、处理要求高的离线框架。每台机器必须将其所有测量值发送到一个集中位置，这在某些设置中是禁止的，并且需要数据并行处理基础设施。在这项工作中，我们调整了潜在故障检测器，以提供在线，通信和计算减少的版本。我们利用流处理技术来交换通信和计算的准确性。我们首先描述了一种新的通信高效在线分布式方差监测算法，该算法在保证的近似范围内提供全局方差的连续估计。利用方差监测器，我们为横向扩展系统中常见的非平稳多变量时间序列提供了一个在线分布式离群值检测框架。调整后的框架通过就地处理数据减少了数据大小和中央处理成本，使其可用于更广泛的环境。与原始框架一样，我们的调整允许不同的比较函数，支持非平稳数据，并提供误报率的统计保证。对生产系统日志的模拟表明，与原始算法相比，我们能够将带宽减少一个数量级，误差低于1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量