Proceedings of the Second International Symposium on Memory Systems最新文献_第5页

Checkpointing Exascale Memory Systems with Existing Memory Technologies 基于现有内存技术的百亿亿级内存系统检查点

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989121

Nilmini Abeyratne, H. Chen, Byoungchan Oh, R. Dreslinski, C. Chakrabarti, T. Mudge

{"title":"Checkpointing Exascale Memory Systems with Existing Memory Technologies","authors":"Nilmini Abeyratne, H. Chen, Byoungchan Oh, R. Dreslinski, C. Chakrabarti, T. Mudge","doi":"10.1145/2989081.2989121","DOIUrl":"https://doi.org/10.1145/2989081.2989121","url":null,"abstract":"Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been proposed. However, using unproven memories to create checkpoints will only increase the design risk for exascale memory systems. In this paper, we show that exascale systems with hundreds of petabytes of memory can be constructed with commodity DRAM and SSD flash memory and that newer non-volatile memory are unnecessary, at least for the next generation. The challenge when using commodity parts is providing fast and reliable checkpointing to protect against system failures. A straightforward solution of checkpointing to local flash-based SSD devices will not work because they are endurance and performance limited. We present a checkpointing solution that employs a combination of DRAM and SSD devices. A Checkpoint Location Controller (CLC) is implemented to monitor the endurance of the SSD and the performance loss of the application and to decide dynamically whether to checkpoint to the DRAM or the SSD. The CLC improves both SSD endurance and application slowdown; but the checkpoints in DRAM are exposed to device failures. To design a reliable exascale memory, we protect the data with a low latency ECC that can correct all errors due to bit/pin/column/word faults and also detect errors due to chip failures, and we protect the checkpoint with a Chipkill-Correct level ECC that allows reliable checkpointing to the DRAM. Using our system, the SSD lifetime increases by 2x---from 3 years to 6.3 years. Furthermore, the CLC reduces the average checkpointing overhead by nearly 10x (47% from a 420% slowdown), compared to when the application always checkpointed to the SSD.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127546422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Case for Associative DRAM Caches 关联DRAM缓存的案例

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989120

Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, B. Jacob

引用次数: 3

A Validation of DRAM RAPL Power Measurements DRAM RAPL功率测量的验证

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989088

Spencer Desrochers, Chad Paradis, Vincent M. Weaver

{"title":"A Validation of DRAM RAPL Power Measurements","authors":"Spencer Desrochers, Chad Paradis, Vincent M. Weaver","doi":"10.1145/2989081.2989088","DOIUrl":"https://doi.org/10.1145/2989081.2989088","url":null,"abstract":"Recent Intel processors support the Running Average Power Level (RAPL) interface, which among other things provides estimated energy measurements for the CPUs, integrated GPU, and DRAM. These measurements are easily accessible by the user, and can be gathered by a wide variety of tools, including the Linux perf_event interface. This allows unprecedented easy access to energy information when designing and optimizing energy-aware code. While greatly useful, on most systems these RAPL measurements are estimated values, generated on the fly by an on-chip energy model. The values are not documented well, and the results (especially the DRAM results) have undergone only limited validation. We validate the DRAM RAPL results on both desktop and server Haswell machines, with multiple types of DDR3 and DDR4 memory. We instrument the hardware to gather actual power measurements and compare them to the RAPL values returned via Linux perf_event. We describe the many challenges encountered when instrumenting systems for detailed power measurement. We find that the RAPL results match overall energy and power trends, usually by a constant power offset. The results match best when the DRAM is being heavily utilized, but do not match as well in cases where the system is idle, or when an integrated GPU is using the memory. We also verify that Haswell server machines produce more accurate results, as they include actual power measurements gathered through the integrated voltage regulator.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133859107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Performance Impact of a Slower Main Memory: A case study of STT-MRAM in HPC 慢速主存对性能的影响:高性能计算中STT-MRAM的案例研究

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989082

Kazi Asifuzzaman, Milan Pavlović, M. Radulovic, D. Zaragoza, Oh-Jeong Kwon, K. Ryoo, Petar Radojkovic

引用次数: 15

CLARA: Circular Linked-List Auto and Self Refresh Architecture 循环链表自动和自我刷新架构

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989084

Aditya Agrawal, Mike O'Connor, Evgeny Bolotin, Niladrish Chatterjee, J. Emer, S. Keckler

{"title":"CLARA: Circular Linked-List Auto and Self Refresh Architecture","authors":"Aditya Agrawal, Mike O'Connor, Evgeny Bolotin, Niladrish Chatterjee, J. Emer, S. Keckler","doi":"10.1145/2989081.2989084","DOIUrl":"https://doi.org/10.1145/2989081.2989084","url":null,"abstract":"With increasing DRAM densities, the performance and energy overheads of refresh operations are increasingly significant. When the system is active, refresh commands render DRAM banks unavailable for increasing periods of time. These refresh operations can interfere with regular memory operations and hurt performance. In addition, when the system is idle, DRAM self-refresh is the dominant source of energy consumption, and it directly impacts battery life and standby time. Prior refresh reduction techniques seek to reduce active-mode auto-refresh energy, reduce self-refresh energy, improve performance, or some combination thereof. In this paper, we present CLARA, a circular linked-list based refresh architecture which meets all three goals with very low overheads and without sacrificing DRAM capacity. This approach exploits the variation in retention time at a chip granularity as opposed to a DIMM-wide, rank granularity in prior work. CLARA reduces auto- and self-refresh by 86.2%, independent of workload. Auto refresh reduction improves average CPU performance by 3.1% and 6.5% in the normal and extended temperature range, respectively. GPU performance improves by 2.1% on average in the extended temperature range. DRAM idle power during self-refresh is reduced by 44%. The area overhead of CLARA in the DRAM is about 0.085% and negligible in the memory controller.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134583701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

DRAMScale: Mechanisms to Increase DRAM Capacity DRAMScale:增加DRAM容量的机制

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989109

Krishna T. Malladi, Uksong Kang, M. Awasthi, Hongzhong Zheng

引用次数: 3

Analyzing Consistency Issues in HMC Atomics 分析HMC原子中的一致性问题

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989104

Pranith Kumar, Lifeng Nai, Hyesoon Kim

引用次数: 2

Using Memristor Technology for Multi-value Registers in Signed-digit Arithmetic Circuits 用忆阻器技术实现符号数字算术电路中的多值寄存器

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989124

D. Fey, M. Reichenbach, Christopher Söll, Mehrdad Biglari, Jürgen Röber, R. Weigel

{"title":"Using Memristor Technology for Multi-value Registers in Signed-digit Arithmetic Circuits","authors":"D. Fey, M. Reichenbach, Christopher Söll, Mehrdad Biglari, Jürgen Röber, R. Weigel","doi":"10.1145/2989081.2989124","DOIUrl":"https://doi.org/10.1145/2989081.2989124","url":null,"abstract":"Signed-digit (SD) arithmetic exploits positive and negative digits requiring more than two states. It is long known that an addition using trits, i.e. each digit stores not only a 0 or a 1 but also either 2 or -1, requires only a constant number of steps independent of the operands' word length. However, current processors could not profit from that due to the lack of fast, dense and CMOS compatible memory cells that can store reliably multiple states. Memristors offer these features making it necessary to re-evaluate different SD number representations and to evaluate the consequences of an implementation of a multi-value register file with memristors concerning latency, area and energy consumption. Using memristors as multi-value register reduces latency and area on one side compared to flip-flop based memories. On the other side this requires additional sophisticated control circuitry to implement ADCs/DACs, current limiting circuits and to generate control signals to read, write and erase memristors. The paper determines the break-even points at which ternary circuits attached to memristor based registers show better energy-delay products and less area consumption and how much power consumption these improvements cost. By layout synthesis is shown that ternary adders with trit-storing memristors can reduce the latency for a word length of 16 digits about 19% and about 52% for 512 digits compared to a binary carry-look-ahead (CLA) adder with nearly the same power consumption.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125025623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View 用异构存储器和异构处理器编程系统的挑战:一个程序员的观点

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989097

Shuai Che, Arkaprava Basu, J. Gallmeier

引用次数: 2

HAPPY: Hybrid Address-based Page Policy in DRAMs ram中基于地址的混合页面策略

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2015-09-12 DOI: 10.1145/2989081.2989101

M. Ghasempour, A. Jaleel, J. Garside, M. Luján

{"title":"HAPPY: Hybrid Address-based Page Policy in DRAMs","authors":"M. Ghasempour, A. Jaleel, J. Garside, M. Luján","doi":"10.1145/2989081.2989101","DOIUrl":"https://doi.org/10.1145/2989081.2989101","url":null,"abstract":"Memory controllers have used static page closure policies to decide whether a row should be left open, open-page policy, or closed immediately, close-page policy, after the row has been accessed. The appropriate choice for a particular access can reduce the average memory latency. However, since application access patterns change at run time, static page policies cannot guarantee to deliver optimum execution time. Hybrid page policies have been investigated as a means of covering these dynamic scenarios and are now implemented in state-of-the-art processors. Hybrid page policies switch between open-page and close-page policies while the application is running, by monitoring the access pattern of row hits/conflicts and predicting future behavior. Unfortunately, as the size of DRAM memory increases, fine-grain tracking and analysis of memory access patterns does not remain practical. We propose a compact memory address-based encoding technique which can improve or maintain the performance of DRAMs page closure predictors while reducing the hardware overhead in comparison with state-of-the-art techniques. As a case study, we integrate our technique, HAPPY, with a state-of-the-art Intel-adaptive monitor (e.g. part of the Intel Xeon X5650) and a traditional Hybrid page policy. We evaluate them across 70 memory intensive workload mixes consisting of single-thread and multi-thread applications. The experimental results show that using the HAPPY encoding applied to the Intel-adaptive page closure policy can reduce the hardware overhead by 5x for the evaluated 64 GB memory (up to 40× for a 512 GB memory) while maintaining the prediction accuracy.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115254007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13