Analysis and modeling of time-correlated failures in large-scale distributed systems

2010 11th IEEE/ACM International Conference on Grid Computing Pub Date : 2010-10-01 DOI:10.1109/GRID.2010.5697961

N. Yigitbasi, M. Gallet, Derrick Kondo, A. Iosup, D. Epema

{"title":"Analysis and modeling of time-correlated failures in large-scale distributed systems","authors":"N. Yigitbasi, M. Gallet, Derrick Kondo, A. Iosup, D. Epema","doi":"10.1109/GRID.2010.5697961","DOIUrl":null,"url":null,"abstract":"The analysis and modeling of the failures bound to occur in today's large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large-scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems.","PeriodicalId":6372,"journal":{"name":"2010 11th IEEE/ACM International Conference on Grid Computing","volume":"8 1","pages":"65-72"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 11th IEEE/ACM International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2010.5697961","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

Abstract

The analysis and modeling of the failures bound to occur in today's large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large-scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems.

查看原文本刊更多论文

大型分布式系统中时间相关故障的分析与建模

对当今大规模生产系统中必然发生的故障进行分析和建模，对于提供使这些系统容错且高效所需的理解是非常宝贵的。许多先前的研究在假定失效是相同但独立分布的情况下，没有考虑失效的时变行为。然而，故障之间存在的时间相关性(例如故障率增加的高峰时段)驳斥了这一假设，并可能对容错机制的有效性产生重大影响。例如，如果故障是周期性的或可预测的，则主动容错机制的性能会更有效;类似地，检查点、冗余和调度解决方案的性能取决于故障的频率。本文对大型分布式系统的时变故障行为进行了分析和建模。我们的研究基于19个故障轨迹，这些故障轨迹来自(大部分)生产大规模分布式系统，包括网格、P2P系统、DNS服务器、web服务器和桌面网格。我们首先研究了故障的时间相关性，发现许多研究的痕迹表现出强烈的日模式和高度的自相关性。在此基础上，推导了一个针对实际大规模分布式系统中出现的峰值失效期的模型。我们的模型描述了峰的持续时间、峰间到达时间、峰间故障到达时间和峰间故障持续时间;我们从一组候选分布中确定每个最佳拟合概率分布，并给出(最佳)拟合的参数。最后，我们针对19个实际故障轨迹验证了我们的模型，并发现它所表征的故障平均占这些系统停机时间的50%以上，最高可达95%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 11th IEEE/ACM International Conference on Grid Computing

自引率

0.00%

发文量