{"title":"前瞻性替换的不可纠正记忆错误预测:基于大规模现场数据的实证研究","authors":"Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li","doi":"10.1109/EDCC51268.2020.00016","DOIUrl":null,"url":null,"abstract":"Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.","PeriodicalId":212573,"journal":{"name":"2020 16th European Dependable Computing Conference (EDCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data\",\"authors\":\"Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li\",\"doi\":\"10.1109/EDCC51268.2020.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.\",\"PeriodicalId\":212573,\"journal\":{\"name\":\"2020 16th European Dependable Computing Conference (EDCC)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 16th European Dependable Computing Conference (EDCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EDCC51268.2020.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 16th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC51268.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data
Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.