H. Chen, A. Arunkumar, Carole-Jean Wu, T. Mudge, C. Chakrabarti
{"title":"E-ECC: Low Power Erasure and Error Correction Schemes for Increasing Reliability of Commodity DRAM Systems","authors":"H. Chen, A. Arunkumar, Carole-Jean Wu, T. Mudge, C. Chakrabarti","doi":"10.1145/2818950.2818961","DOIUrl":null,"url":null,"abstract":"Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Symposium on Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818950.2818961","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Most server-grade memory systems provide Chipkill-Correct error protection at the expense of power and/or performance overhead. In this paper we present low overhead schemes for improving the reliability of commodity DRAM systems with better power and IPC performance compared to Chipkill-Correct solutions. Specifically, we propose two erasure and error correction (E-ECC) schemes for x8 memory systems that have 12.5% storage overhead and do not require any change in the existing memory architecture. Both schemes have superior error performance due to the use of a strong ECC code, namely, RS(36,32) over GF(28). Scheme 1 activates 18 chips per access and has stronger reliability compared to Chipkill-Correct solutions. If the location of the faulty chip is known, Scheme 1 can correct an additional random error in a second chip. Scheme 2 trades off reliability for higher energy efficiency by activating only 9 chips per access. It cannot correct random errors due to a chip failure but can detect them with 99.9986% probability, and once a chip is marked faulty due to persistent errors, it can correct all errors due to that chip. Synthesis results in 28nm node show that the RS (36,32) code results in a very low decoding latency that can be well-hidden in commodity memory systems and, therefore, it has minimal effect on the DRAM access latency. Evaluations based on SPEC CPU 2006 sequential and multi-programmed workloads show that compared to Chipkill-Correct, the proposed Schemes 1 and 2 improve IPC by an average of 3.2% (maximum of 13.8%) and 4.8% (maximum of 31.8%) and reduce the power consumption by an average of 16.2% (maximum of 25%) and 26.8% (maximum of 36%), respectively.
大多数服务器级内存系统以牺牲电源和/或性能开销为代价提供Chipkill-Correct错误保护。在本文中,我们提出了一种低开销方案,用于提高商品DRAM系统的可靠性,与Chipkill-Correct解决方案相比,它具有更好的功率和IPC性能。具体来说,我们提出了两种用于x8内存系统的擦除和纠错(E-ECC)方案,它们具有12.5%的存储开销,并且不需要对现有内存架构进行任何更改。由于使用了强ECC码,即RS(36,32) / GF(28),这两种方案都具有优越的误差性能。方案1每次访问激活18个芯片,与Chipkill-Correct方案相比具有更强的可靠性。如果故障芯片的位置已知,方案1可以纠正第二个芯片中的额外随机错误。方案2通过每次访问仅激活9个芯片来换取更高的能源效率。它不能纠正由于芯片故障而导致的随机错误,但可以以99.9986%的概率检测到它们,一旦一个芯片由于持续错误而被标记为故障,它可以纠正由于该芯片导致的所有错误。28nm节点的合成结果表明,RS(36,32)编码导致非常低的解码延迟,可以很好地隐藏在商品存储系统中,因此,它对DRAM访问延迟的影响最小。基于SPEC CPU 2006顺序和多程序工作负载的评估表明,与Chipkill-Correct相比,所提出的方案1和方案2将IPC平均提高3.2%(最高13.8%)和4.8%(最高31.8%),并将功耗平均降低16.2%(最高25%)和26.8%(最高36%)。