Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI:10.1145/2694344.2694348

Vilas Sridharan, Nathan Debardeleben, S. Blanchard, Kurt B. Ferreira, Jon Stearley, J. Shalf, S. Gurumurthi

{"title":"Memory Errors in Modern Systems: The Good, The Bad, and The Ugly","authors":"Vilas Sridharan, Nathan Debardeleben, S. Blanchard, Kurt B. Ferreira, Jon Stearley, J. Shalf, S. Gurumurthi","doi":"10.1145/2694344.2694348","DOIUrl":null,"url":null,"abstract":"Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"256","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2694344.2694348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 256

Abstract

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

查看原文本刊更多论文

现代系统中的内存错误:好的、坏的和丑陋的

最近的一些出版物表明，内存子系统中的硬件故障是常见的。预计这些故障将在未来的系统中变得更加频繁，这些系统中包含的DRAM和SRAM将比当前的内存子系统多出几个数量级。当部署在包含数万个节点的高性能计算系统和数据中心中时，这些内存子系统将需要提供弹性技术来容忍这些故障。因此，了解当前硬件弹性技术的有效性以确定它们是否适用于未来的系统是至关重要的。本文从现场研究了DRAM和SRAM的故障和错误。我们使用来自两个领导级高性能计算机系统的数据来分析当前系统中部署的硬件弹性方案对可靠性的影响。我们的研究对许多目前部署的可靠性技术(如DRAM ECC、DDR地址/命令奇偶校验以及SRAM ECC和奇偶校验)的有效性有几个关键发现。我们还进行了一项方法学研究，发现计算错误而不是故障，这是研究人员和数据中心操作员的一种常见做法，可能导致关于系统可靠性的错误结论。最后，我们使用我们的数据来预测未来大规模系统的需求。我们发现SRAM故障在未来不太可能构成更大的可靠性威胁，而DRAM故障将是一个主要问题，需要更强的DRAM弹性方案来维持与当今系统相似的可接受故障率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

自引率

0.00%

发文量