Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao
{"title":"面向云服务可靠性的分层智能内存故障预测","authors":"Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao","doi":"10.1109/DSN58367.2023.00031","DOIUrl":null,"url":null,"abstract":"In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.","PeriodicalId":427725,"journal":{"name":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability\",\"authors\":\"Qiao Yu, Wengui Zhang, Paolo Notaro, Soroush Haeri, Jorge Cardoso, O. Kao\",\"doi\":\"10.1109/DSN58367.2023.00031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.\",\"PeriodicalId\":427725,\"journal\":{\"name\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN58367.2023.00031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN58367.2023.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HiMFP: Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability
In large-scale datacenters, memory failure is one of the leading causes of server crashes, and uncorrectable error (UCE) is the major fault type indicating defects of memory modules. Existing approaches tend to predict UCEs using Correctable Errors (CE). However, bit-level CE information has not been completely discussed in previous works and CEs with error bit patterns are strongly correlated with UCE occurrences. In this paper, we present a novel Hierarchical Intelligent Memory Failure Prediction (HiMFP) framework which can predict UCEs on multiple levels of the memory system and associate with memory recovery techniques. Particularly, we leverage CE addresses on multiple levels of memory, especially bit-level, and construct machine learning models based on spatial and temporal CE information. Results of algorithm evaluation using real-world datasets indicate that HiMFP significantly enhances the prediction performance compared with the baseline algorithm. Overall, Virtual Machines (VM) interruptions caused by UCEs can be reduced by around 45% using HiMFP.