{"title":"我们能从四年的数据中心硬件故障中学到什么?","authors":"Guosai Wang, Lifei Zhang, W. Xu","doi":"10.1109/DSN.2017.26","DOIUrl":null,"url":null,"abstract":"Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.","PeriodicalId":426928,"journal":{"name":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"93","resultStr":"{\"title\":\"What Can We Learn from Four Years of Data Center Hardware Failures?\",\"authors\":\"Guosai Wang, Lifei Zhang, W. Xu\",\"doi\":\"10.1109/DSN.2017.26\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.\",\"PeriodicalId\":426928,\"journal\":{\"name\":\"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"93\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2017.26\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2017.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
What Can We Learn from Four Years of Data Center Hardware Failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We present studies on over 290,000 hardware failure reports collected over the past four years from dozens of data centers with hundreds of thousands of servers. We examine the dataset statistically to discover failure characteristics along the temporal, spatial, product line and component dimensions. We specifically focus on the correlations among different failures, including batch and repeating failures, as well as the human operators' response to the failures. We reconfirm or extend findings from previous studies. We also find many new failure and recovery patterns that are the undesirable by-product of the state-of-the-art data center hardware and software design.