An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

P. Garraghan, P. Townend, Jie Xu
{"title":"An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment","authors":"P. Garraghan, P. Townend, Jie Xu","doi":"10.1109/HASE.2014.24","DOIUrl":null,"url":null,"abstract":"Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.","PeriodicalId":132930,"journal":{"name":"2014 IEEE 15th International Symposium on High-Assurance Systems Engineering","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 15th International Symposium on High-Assurance Systems Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HASE.2014.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 66

Abstract

Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.
大规模云计算环境的经验故障分析
云计算研究非常需要从现实世界系统的分析中得到统计参数。其中一个方面是由工作负载和服务器组成的云环境的故障特征,目前,很少有指标可以大规模地量化工作负载和服务器的故障和修复时间。特别是工作负载度量对于描述和建模准确的工作负载行为,支持更真实的工作负载模拟和系统故障场景至关重要。本文介绍了对大规模生产云环境(由超过12,500台服务器组成)的故障数据的分析,并包括对云工作负载和服务器的故障和修复时间和特征的研究。我们的结果表明,工作负载和服务器的故障特征是高度可变的,生产云工作负载可以通过Gamma分布精确地建模。对于工作负载和服务器,修复时间分别为30秒到4天,25分钟到8天。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信