{"title":"Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management","authors":"S. Fu, Chengzhong Xu","doi":"10.1109/SRDS.2007.18","DOIUrl":null,"url":null,"abstract":"Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.","PeriodicalId":224921,"journal":{"name":"2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS.2007.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 72
Abstract
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.
网络计算系统在规模和其组件和交互的复杂性方面继续增长。在这些环境中,组件故障成为规范而不是异常。此外,失效事件在时间和空间上表现出很强的相关性。在本文中,我们建立了一个具有可调时间尺度参数的球形协方差模型来量化时间相关性,并建立了一个随机模型来表征空间相关性。在此基础上对模型进行了扩展,考虑了应用程序分配信息,以发现故障实例之间更多的相关性。我们根据它们的相关性对故障事件进行聚类,并预测它们未来的发生情况。在生产联盟系统Wayne State Grid上的实验结果表明,该预测系统的离线和在线预测可以预测集群联盟环境下72.7% ~ 85.3%的故障发生,并捕获故障相关性。