{"title":"What Can We Learn from Four Years of Data Center Hardware Failures?","authors":"Guosai Wang, Lifei Zhang, W. Xu","doi":"10.1109/DSN.2017.26","DOIUrl":null,"url":null,"abstract":"Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.","PeriodicalId":426928,"journal":{"name":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"93","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2017.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 93
Abstract
Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.