基于现有内存技术的百亿亿级内存系统检查点

Nilmini Abeyratne, H. Chen, Byoungchan Oh, R. Dreslinski, C. Chakrabarti, T. Mudge
{"title":"基于现有内存技术的百亿亿级内存系统检查点","authors":"Nilmini Abeyratne, H. Chen, Byoungchan Oh, R. Dreslinski, C. Chakrabarti, T. Mudge","doi":"10.1145/2989081.2989121","DOIUrl":null,"url":null,"abstract":"Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2x---from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10x (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Checkpointing Exascale Memory Systems with Existing Memory Technologies\",\"authors\":\"Nilmini Abeyratne, H. Chen, Byoungchan Oh, R. Dreslinski, C. Chakrabarti, T. Mudge\",\"doi\":\"10.1145/2989081.2989121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2x---from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10x (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.\",\"PeriodicalId\":283512,\"journal\":{\"name\":\"Proceedings of the Second International Symposium on Memory Systems\",\"volume\":\"76 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Second International Symposium on Memory Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2989081.2989121\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second International Symposium on Memory Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2989081.2989121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

构建百亿亿次超级计算机需要对处理器、内存、存储和网络设备等故障组件具有弹性。检查点/重新启动是实现弹性的关键因素,但是随着检查点的数据量和可能失败的组件数量在百亿亿级系统中增加,提供快速可靠的检查点变得越来越具有挑战性。为了提高检查点的速度,提出了新兴的非易失性存储器(相变、磁性、电阻性RAM)。然而,使用未经验证的内存来创建检查点只会增加百亿亿级内存系统的设计风险。在本文中,我们展示了具有数百pb内存的百亿亿级系统可以使用商品DRAM和SSD闪存构建,并且不需要更新的非易失性存储器,至少对于下一代来说是这样。使用商品部件时的挑战是提供快速可靠的检查点以防止系统故障。检查点指向本地基于闪存的SSD设备的直接解决方案将不起作用,因为它们的耐用性和性能有限。我们提出了一种采用DRAM和SSD设备组合的检查点解决方案。检查点位置控制器(CLC)用于监视SSD的持久时间和应用程序的性能损失,并动态地决定是检查点到DRAM还是SSD。CLC提高了SSD的耐用性和应用程序减速;但是DRAM中的检查点容易受到设备故障的影响。为了设计可靠的百万兆级内存,我们使用低延迟ECC来保护数据,该ECC可以纠正由于位/引脚/列/字错误引起的所有错误,并且还可以检测由于芯片故障引起的错误,并且我们使用芯片kill- correct级别的ECC来保护检查点,允许可靠的检查点到DRAM。使用我们的系统,SSD的使用寿命增加了2倍——从3年增加到6.3年。此外,与应用程序总是检查点到SSD时相比,CLC将平均检查点开销减少了近10倍(从420%的速度降低到47%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Checkpointing Exascale Memory Systems with Existing Memory Technologies
Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2x---from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10x (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信