面向云服务可靠性的分层智能内存故障预测

Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao
{"title":"面向云服务可靠性的分层智能内存故障预测","authors":"Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao","doi":"10.1109/DSN58367.2023.00031","DOIUrl":null,"url":null,"abstract":"In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.","PeriodicalId":427725,"journal":{"name":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability\",\"authors\":\"Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao\",\"doi\":\"10.1109/DSN58367.2023.00031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.\",\"PeriodicalId\":427725,\"journal\":{\"name\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN58367.2023.00031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN58367.2023.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在大型数据中心中,内存故障是导致服务器崩溃的主要原因之一,不可纠正错误(uncorrectable error, UCE)是内存模块存在缺陷的主要故障类型。现有的方法倾向于使用可纠正误差(CE)来预测UCEs。然而,位级CE信息在以前的工作中并没有得到完整的讨论,具有错误位模式的CE与UCE的发生密切相关。本文提出了一种新的分层智能记忆故障预测(HiMFP)框架,该框架可以预测记忆系统的多级UCEs,并与记忆恢复技术相关联。特别是,我们在多个内存级别(特别是位级)上利用CE地址,并基于空间和时间CE信息构建机器学习模型。使用实际数据集对算法进行评估的结果表明,与基线算法相比,HiMFP显著提高了预测性能。总体而言,使用HiMFP可以将由uce引起的虚拟机(VM)中断减少约45%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability
In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信