Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data

Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li
{"title":"Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data","authors":"Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li","doi":"10.1109/EDCC51268.2020.00016","DOIUrl":null,"url":null,"abstract":"Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.","PeriodicalId":212573,"journal":{"name":"2020 16th European Dependable Computing Conference (EDCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 16th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC51268.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.
前瞻性替换的不可纠正记忆错误预测:基于大规模现场数据的实证研究
不可纠正的内存错误是导致数据中心服务器故障的主要原因。使用历史可纠正错误(CE)信息预测不可纠正错误(ue)有助于在灾难性事件发生之前主动更换内存硬件。在本文中,我们对腾讯数据中心3万多台当代服务器为期8个月的大规模现场数据进行了UE预测的实证研究。我们证明了传统的基于CE率的方法效果较差,精度较低。然后,我们利用详细的微观级CE信息来设计几个新的预测器。对比研究表明,基于列故障识别的新预测器将基线精度提高了300%以上,同时也大幅提高了基线召回率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信