E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI:10.1145/2818950.2818961

H. Chen, A. Arunkumar, Carole-Jean Wu, T. Mudge, C. Chakrabarti

{"title":"E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems","authors":"H. Chen, A. Arunkumar, Carole-Jean Wu, T. Mudge, C. Chakrabarti","doi":"10.1145/2818950.2818961","DOIUrl":null,"url":null,"abstract":"Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Symposium on Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818950.2818961","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.

查看原文本刊更多论文

E-ECC:提高商品DRAM系统可靠性的低功耗擦除和纠错方案

大多数服务器级内存系统以牺牲电源和/或性能开销为代价提供Chipkill-Correct错误保护。在本文中，我们提出了一种低开销方案，用于提高商品DRAM系统的可靠性，与Chipkill-Correct解决方案相比，它具有更好的功率和IPC性能。具体来说，我们提出了两种用于x8内存系统的擦除和纠错(E-ECC)方案，它们具有12.5%的存储开销，并且不需要对现有内存架构进行任何更改。由于使用了强ECC码，即RS(36,32) / GF(28)，这两种方案都具有优越的误差性能。方案1每次访问激活18个芯片，与Chipkill-Correct方案相比具有更强的可靠性。如果故障芯片的位置已知，方案1可以纠正第二个芯片中的额外随机错误。方案2通过每次访问仅激活9个芯片来换取更高的能源效率。它不能纠正由于芯片故障而导致的随机错误，但可以以99.9986%的概率检测到它们，一旦一个芯片由于持续错误而被标记为故障，它可以纠正由于该芯片导致的所有错误。28nm节点的合成结果表明，RS(36,32)编码导致非常低的解码延迟，可以很好地隐藏在商品存储系统中，因此，它对DRAM访问延迟的影响最小。基于SPEC CPU 2006顺序和多程序工作负载的评估表明，与Chipkill-Correct相比，所提出的方案1和方案2将IPC平均提高3.2%(最高13.8%)和4.8%(最高31.8%)，并将功耗平均降低16.2%(最高25%)和26.8%(最高36%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 International Symposium on Memory Systems

自引率

0.00%

发文量