{"title":"Understanding the Impact of Air and Microfluidics Cooling on Performance of 3D Stacked Memory Systems","authors":"S. M. Hassan, S. Yalamanchili","doi":"10.1145/2989081.2989098","DOIUrl":"https://doi.org/10.1145/2989081.2989098","url":null,"abstract":"Three-dimensional stacking has increased the memory bandwidth available to cores allowing sustainable performance improvement through technology generations. However, lower heat removal capability and higher DRAM density in such systems increases their temperature and requires larger number of rows to be refreshed at significantly higher rates. Higher operating temperature prohibits performance scaling by not only decreasing memory bandwidth availability but also reducing core frequency specially in the case where memory is stacked directly on top of the processor die (3D). Liquid cooling using microfluidics technology is a promising solution that keeps the temperature low increasing the operating range of 3D systems, thus allowing sustained performance improvement. This work attempts to understand the impact of temperature on performance and the advantages of using microfluidics technology for continued performance scaling. We show that conventional air cooling solutions limit 3D stacks to work only for memory-intensive applications running at low frequency, whereas microfluidics cooling technology allow them to push their envelope to not only compute intensive domains but also memory-intensive scenarios that can run at significantly higher operating frequencies.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127074478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ibrahim, Farzad Fatollahi-Fard, D. Donofrio, J. Shalf
{"title":"Characterizing the Performance of Hybrid Memory Cube Using ApexMAP Application Probes","authors":"K. Ibrahim, Farzad Fatollahi-Fard, D. Donofrio, J. Shalf","doi":"10.1145/2989081.2989090","DOIUrl":"https://doi.org/10.1145/2989081.2989090","url":null,"abstract":"Full characterization of the performance of a new memory technology is typically a subtle process because of the difficulty in subjecting the memory to different access patterns before creating a full system. Simple performance characterization, such as raw bandwidth, does not give enough information about the suitability of the memory for different architectural design choices, such as suitability for processing in memory, performance reliance on relaxed ordering semantic, or how to implement atomics, etc. This paper discusses the use of the ApexMAP synthetic benchmarks to assess the Hybrid Memory Cube (HMC) technology. ApexMAP, through a simple model for spatial and temporal locality, allows creating many application probes that could be used to subject the memory to different access patterns. We use a Verilog implementation of ApexMAP to show the impact of contending requests, flow control, and access granularity on the HMC performance. We show a wide variation (up to 20×) in the observed performance based on the application locality parameters and the HMC architectural configurations.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127364685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaochen Guo, Aviral Shrivastava, Michael F. Spear, Gang Tan
{"title":"Languages Must Expose Memory Heterogeneity","authors":"Xiaochen Guo, Aviral Shrivastava, Michael F. Spear, Gang Tan","doi":"10.1145/2989081.2989122","DOIUrl":"https://doi.org/10.1145/2989081.2989122","url":null,"abstract":"The last decade has seen an explosion in new and innovative memory technologies. While certain technologies, like transactional memory, have seen adoption at the language level, others, such as sandboxed memory, scratchpad memory, and persistent memory, have not received any systematic programming language support. This is true even though the underlying compiler-level mechanisms for these mechanisms are similar. In this paper, we argue that programming languages must be enhanced to expose heterogeneous memory technologies to programmers, so that they can enjoy the benefits of those technologies and be able to reason about programs that use the advanced features of novel memory technologies. We sketch a language design that allows programmers to specify memory requirements and behaviors, for both data and code. We further describe how a compiler can support such a language and suggest hardware improvements that can improve efficiencies of heterogeneous memories.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134201348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How Many MLCs Should Impersonate SLCs to Optimize SSD Performance?","authors":"Wei Wang, Wen Pan, T. Xie, Deng Zhou","doi":"10.1145/2989081.2989095","DOIUrl":"https://doi.org/10.1145/2989081.2989095","url":null,"abstract":"Since an MLC (multi-level cell) can be used in an SLC (single-level cell) mode, an MLC-based flash SSD typically uses a fixed small portion (called log partition) in the SLC mode to accommodate hot data so that its overall performance can be improved. In this paper, we show that a fixed capacity of a log partition without considering workload characteristics can lead to an unexpected overall performance degradation. Contrary to intuition, we notice that blindly enlarging the capacity of a log partition would also result in worse performance due to the increased garbage collection cost in a data partition, which serves cold data. How many MLCs should impersonate SLCs under a particular workload to achieve an optimized performance is still an open question. To answer this question, we first measure write costs on each partition and their impact on the overall performance of an SSD. Next, a hardware-validated write cost model is built. Based on the model, we demonstrate that for each workload there always exists an optimal partitioning scheme. Further, to verify the effectiveness of our workload-aware dynamic partitioning strategy, we implement an FTL (flash translation layer) called BROMS (Best Ratio Of MLC to SLC), which adaptively adjusts the capacities of two partitions according to the workload characteristics. Experimental results from a hardware platform show that BROMS outperforms a fixed partitioning scheme by up to 86%.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115402096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manish Gupta, D. Roberts, Mitesh R. Meswani, Vilas Sridharan, D. Tullsen, Rajesh K. Gupta
{"title":"Reliability and Performance Trade-off Study of Heterogeneous Memories","authors":"Manish Gupta, D. Roberts, Mitesh R. Meswani, Vilas Sridharan, D. Tullsen, Rajesh K. Gupta","doi":"10.1145/2989081.2989113","DOIUrl":"https://doi.org/10.1145/2989081.2989113","url":null,"abstract":"Heterogeneous memories, organized as die-stacked in-package and off-package memory, have been a focus of attention by the computer architects to improve memory bandwidth and capacity. Researchers have explored methods and organizations to optimize performance by increasing the access rate to faster die-stacked memory. Unfortunately, reliability of such arrangements has not been studied carefully thus making them less attractive for data centers and mission-critical systems. Field studies show memory reliability depends on device physics as well as on error correction codes (ECC). Due to the capacity, latency, and energy costs of ECC, the performance-critical in-package memories may favor weaker ECC solutions than off-chip. Moreover, these systems are optimized to run at peak performance by increasing access rate to high-performance in-package memory. In this paper, authors use the real-world DRAM failure data to conduct a trade-off study on reliability and performance of Heterogeneous Memory Architectures (HMA). This paper illustrates the problem that an HMA system which only optimizes for performance may suffer from impaired reliability over time. This work also proposes an age-aware access rate control algorithm to ensure reliable operation of long-running systems.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115685927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reverse Engineering of DRAMs: Row Hammer with Crosshair","authors":"Matthias Jung, C. Rheinländer, C. Weis, N. Wehn","doi":"10.1145/2989081.2989114","DOIUrl":"https://doi.org/10.1145/2989081.2989114","url":null,"abstract":"In this paper we present a technique that reconstructs the physical location of memory cells in a Dynamic Random Access Memory (DRAM) without opening the device package and microscoping the device. Our method consists of an retention error analysis while a temperature gradient is applied to the DRAM device. This enables the extraction of the exact neighborhood relation of each single DRAM cell, which can be used to accomplish Row Hammer attacks in a very targeted way. However, this information can also be used to enhance current DRAM retention error models.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114094648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast full system memory checkpointing with SSD-aware memory controller","authors":"Jim Stevens, Paul Tschirhart, B. Jacob","doi":"10.1145/2989081.2989126","DOIUrl":"https://doi.org/10.1145/2989081.2989126","url":null,"abstract":"In this paper, we present a novel memory system checkpointing method that very efficiently stores the complete memory state at a given instant in time to a SSD. Our design relies on a modified memory controller that can issue commands directly to the SSD without relying on system software support and SSD controller firmware that is aware of the checkpoint operation. The checkpoint process occurs in the background while foreground operation is allowed to continue. This efficiency enables our checkpointing mechanism to provide value in various applications including supercomputing, cloud computing, and security.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122942621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nswap2L: Transparently Managing Heterogeneous Cluster Storage Resources for Fast Swapping","authors":"T. Newhall, E. R. Lehman-Borer, Benjamin Marks","doi":"10.1145/2989081.2989107","DOIUrl":"https://doi.org/10.1145/2989081.2989107","url":null,"abstract":"To support data intensive cluster computing, it is increasingly important that node virtual memory (VM) systems make effective use of available fast storage devices for swap or temporary file space. Nswap2L is a novel system that transparently manages a heterogeneous set of storage options commonly found in clusters, including node RAM, disk, flash SSD, PCM, or network storage devices. Nswap2L implements a two-level device driver interface. At the top level, it appears to node operating systems (OSs) as a single, fast, random access device that can be added as a swap partition on cluster nodes. It transparently manages the underlying heterogeneous storage devices, including its own implementation of Network RAM, to which swapped out data are stored. It implements data placement, migration, and prefetching policies that choose which underlying physical devices store swapped-out page data. Its policies incorporate information about device capacity, system load, and the strengths of different physical storage media. By moving device-specific knowledge into Nswap2L, VM policies in the OS can be based solely on typical application access patterns and not on characteristics of underlying physical storage media. Nswap2L's policy decisions are abstracted from the OS, freeing the OS from having to implement specialized policies for different combinations of cluster storage---Nswap2L requires no changes to the OS's VM system. Results of our benchmark tests show that data-intensive applications perform up to 6 times faster on Nswap2L-enabled clusters, and show that our two-level device driver design adds minimal I/O latency to the underlying devices that Nswap2L manages. In addition, we found that even though Nswap2L's Network RAM is faster than any other backing store, its prefetching policy that distributes data over multiple devices results in increased I/O parallelism and can lead to better performance than swapping only to a single underlying device.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130181698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Use of DRAM with Unrepaired Weak Cells in Computing Systems","authors":"Hao Wang, Yin Li, Xuebin Zhang, Xiaoqing Zhao, Hongbin Sun, Tong Zhang","doi":"10.1145/2989081.2989108","DOIUrl":"https://doi.org/10.1145/2989081.2989108","url":null,"abstract":"In current practice, DRAM manufacturers apply redundancy-repair to decommission all the weak cells that cannot satisfy the target data retention time under the worse-case operational conditions (e.g., the highest operating temperature). However, as the DRAM scaling enters sub-20nm regime, it becomes increasingly challenging to repair all the weak cells at reasonable cost. This work studies how one could use DRAM chips with unrepaired weak cells in computing systems. In particular, this work is based upon the simple idea that OS reserves all the error-prone pages, which contain at least one unrepaired weak cell, from being used. Under a relatively high error-prone page rate (e.g., 8%), this basic idea is subject to two issues: (1) Simply reserving all the error-prone pages could make it almost impossible for OS to allocate a continuous fragmentation-free physical memory space for some critical operations such as OS booting and DMA buffering. (2) Since most error-prone pages may only contain few unrepaired weak cells, reserving all the error-prone pages from practical usage could cause noticeable memory resource waste. Aiming to address these issues, this paper presents a controller-based selective page re-mapping strategy to ensure a continuous critical memory region for OS, and develops a software-based memory error tolerance scheme to re-cycle all the error-prone pages for the zRAM function in Linux. Since the first scheme only eliminates the fragmentation in the critical memory region (e.g., 128MB in Linux), the remaining non-critical memory region is still subject to severe fragmentation. Hence, we carried out experiments using SPEC CPU2006 to quantitatively demonstrate that highly fragmented non-critical memory region may not cause significant computing system performance degradation. We further study the latency and hardware cost of implementing the controller-based page re-mapping, and the effectiveness of re-cycling error-prone pages for zRAM in Linux. The experimental results show that our proposed software-based error tolerance scheme degrades the speed performance of zRAM by only up to 7%.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129516712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exposing the Locality of Heterogeneous Memory Architectures to HPC Applications","authors":"Brice Goglin","doi":"10.1145/2989081.2989115","DOIUrl":"https://doi.org/10.1145/2989081.2989115","url":null,"abstract":"High-performance computing requires a deep knowledge of the hardware platform to fully exploit its computing power. The performance of data transfer between cores and memory is becoming critical. Therefore locality is a major area of optimization on the road to exascale. Indeed, tasks and data have to be carefully distributed on the computing and memory resources. We discuss the current way to expose processor and memory locality information in the Linux kernel and in user-space libraries such as the hwloc software project. The current de facto standard structural modeling of the platform as the tree is not perfect, but it offers a good compromise between precision and convenience for HPC runtimes. We present an in-depth study of the software view of the upcoming Intel Knights Landing processor. Its memory locality cannot be properly exposed to user-space applications without a significant rework of the current software stack. We propose an extension of the current hierarchical platform model in hwloc. It correctly exposes new heterogeneous architectures with high-bandwidth or non-volatile memories to applications, while still being convenient for affinity-aware HPC runtimes.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130625915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}