Tale of Tails: Anomaly Avoidance in Data Centers

2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS) Pub Date : 2016-09-01 DOI:10.1109/SRDS.2016.021

Ji Xue, R. Birke, L. Chen, E. Smirni

{"title":"Tale of Tails: Anomaly Avoidance in Data Centers","authors":"Ji Xue, R. Birke, L. Chen, E. Smirni","doi":"10.1109/SRDS.2016.021","DOIUrl":null,"url":null,"abstract":"It is a common practice that today's cloud data centers guard the performance by monitoring the resource usage, e.g., CPU and RAM, and issuing anomaly tickets whenever detecting usages exceeding predefined target values. Ensuring free of such usage anomaly can be extremely challenging, while catering to a large amount of virtual machines (VMs) showing bursty workloads on a limited amount of physical resource. Using resource usage data from production data centers that consist of more than 6K physical machines hosting more than 80K VMs, we identify statistic properties of anomaly instances (AIs) on physical servers, highlighting their burst duration and potential root causes. To strike a tradeoff between a strong performance guarantee and resource provisions, we propose a tail-driven anomaly avoidance policy for boxes, TailGuard, which allows a small fraction of AIs, e.g., 5% of usages can be above the target value, and still avoid severe performance degradation, typically caused by a burst of continuous AI. Specifically, TailGuard first introduces a novel usage tail prediction that explores the similarity patterns across a great number of boxes within a very recent history, and then redistributes the server load in an online fashion by proactive VM cloning and reactive load balancing. Evaluation results show that TailGuard can not only achieve an accuracy comparable with prediction methodology that relies on long history of usage data but also dramatically reduce the number of CPU AIs by 60%, with a tenfold reduction of their duration, from more than 25 time windows to only 2.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS.2016.021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

It is a common practice that today's cloud data centers guard the performance by monitoring the resource usage, e.g., CPU and RAM, and issuing anomaly tickets whenever detecting usages exceeding predefined target values. Ensuring free of such usage anomaly can be extremely challenging, while catering to a large amount of virtual machines (VMs) showing bursty workloads on a limited amount of physical resource. Using resource usage data from production data centers that consist of more than 6K physical machines hosting more than 80K VMs, we identify statistic properties of anomaly instances (AIs) on physical servers, highlighting their burst duration and potential root causes. To strike a tradeoff between a strong performance guarantee and resource provisions, we propose a tail-driven anomaly avoidance policy for boxes, TailGuard, which allows a small fraction of AIs, e.g., 5% of usages can be above the target value, and still avoid severe performance degradation, typically caused by a burst of continuous AI. Specifically, TailGuard first introduces a novel usage tail prediction that explores the similarity patterns across a great number of boxes within a very recent history, and then redistributes the server load in an online fashion by proactive VM cloning and reactive load balancing. Evaluation results show that TailGuard can not only achieve an accuracy comparable with prediction methodology that relies on long history of usage data but also dramatically reduce the number of CPU AIs by 60%, with a tenfold reduction of their duration, from more than 25 time windows to only 2.

查看原文本刊更多论文

尾巴的故事:数据中心的异常避免

今天的云数据中心通常通过监视资源使用情况(例如CPU和RAM)来保护性能，并在检测到使用情况超过预定义的目标值时发出异常票据。确保没有这种使用异常是极具挑战性的，同时在有限的物理资源上满足大量显示突发工作负载的虚拟机(vm)。使用来自生产数据中心的资源使用数据，这些数据中心由超过6K个物理机器托管超过80K个虚拟机组成，我们识别物理服务器上异常实例(ai)的统计属性，突出显示它们的突发持续时间和潜在的根本原因。为了在强大的性能保证和资源供应之间进行权衡，我们为盒子提出了一个尾部驱动的异常避免策略，TailGuard，它允许一小部分AI，例如，5%的使用可以高于目标值，并且仍然避免严重的性能下降，通常是由连续的AI爆发引起的。具体来说，TailGuard首先引入了一种新颖的使用尾预测，该预测在最近的历史中探索大量机器的相似模式，然后通过主动VM克隆和响应性负载平衡以在线方式重新分配服务器负载。评估结果表明，TailGuard不仅可以达到与依赖于长期使用数据的预测方法相当的准确性，而且还可以显着减少60%的CPU ai数量，其持续时间减少了十倍，从超过25个时间窗口减少到只有2个。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)

自引率

0.00%

发文量