{"title":"DRAMPersist: Making DRAM Systems Persistent","authors":"Krishna T. Malladi, M. Awasthi, Hongzhong Zheng","doi":"10.1145/2989081.2989110","DOIUrl":"https://doi.org/10.1145/2989081.2989110","url":null,"abstract":"Modern applications exercise main memory systems in different ways. A lot of scale-out, in-memory applications exploit a number of desirable properties provided by DRAM such as high capacity, low latency and high bandwidth. Although DRAM technology continues to scale aggressively, new resistive memory technologies are on the horizon, promising scalability, density and non-volatility. However, they still suffer from longer, asymmetric read-write latencies and have lower endurance as compared to DRAM. Considering these factors, scale-out, distributed applications will benefit greatly from main memory architectures that provide the non-volatility of new memory technologies, but still have DRAM-like latencies. To that end, we introduce DRAMPersist -- a novel mechanism to make main memory persistent and complement existing high speed storage, specifically geared for scale-out systems.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125822028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Jung, Deepak M. Mathew, C. Weis, N. Wehn, Irene Heinrich, Marco V. Natale, S. O. Krumke
{"title":"ConGen: An Application Specific DRAM Memory Controller Generator","authors":"Matthias Jung, Deepak M. Mathew, C. Weis, N. Wehn, Irene Heinrich, Marco V. Natale, S. O. Krumke","doi":"10.1145/2989081.2989131","DOIUrl":"https://doi.org/10.1145/2989081.2989131","url":null,"abstract":"The increasing gap between the bandwidth requirements of modern Systems on Chip (SoC) and the I/O data rate delivered by Dynamic Random Access Memory (DRAM), known as the Memory Wall, limits the performance of today's data-intensive applications. General purpose memory controllers use online scheduling techniques in order to increase the memory bandwidth. Due to a limited buffer depth they only have a local view on the executed application. However, numerous applications possess regular or fixed memory access patterns, which are not yet exploited to overcome the memory wall. In this paper, we present a holistic methodology to generate an Application Specific Memory Controller (ASMC), which has a global view on the application and utilizes application knowledge to decrease the energy and increase the bandwidth. To generate an ASMC we analyze the DRAM access pattern of the application offline and generate a custom address mapping by solving a combinatorial sequence partitioning problem.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115191025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Tag-Bit Memory Operations in Hybrid Memory Cubes","authors":"John D. Leidel, Yong Chen","doi":"10.1145/2989081.2989105","DOIUrl":"https://doi.org/10.1145/2989081.2989105","url":null,"abstract":"The recent advances in multi-dimensional or stacked memory devices have led to a significant resurgence in research and effort associated with exploring more expressive memory operations in order to improve application throughput. The goal of these efforts is to provide memory operations in the logic layer of a stacked device that provide pseudo processing near memory capabilities to reduce the bandwidth required to perform common operations across concurrent applications. One such area of concern in applications is the ability to provide high performance, low latency mutexes and associated barrier synchronization techniques. Previous attempts at performing cache-based mutex optimization and tiered barrier synchronization provide some degree of application speedup, but still induce sub-optimal scenarios such as cache line contention and large degrees of message traffic. However, several previous architectures have presented techniques that extend the core physical address storage with additional, more expressive bit storage in order to provide fine-grained concurrency mechanisms in hardware. This work presents a novel methodology and associated implementation for providing in-situ extended memory operations in an HMC Gen2 device. The methodology provides a single lock, or tag bit for every 64-bit word in memory using the in-situ storage. Further, we present an address inversion technique that enables the tag-bit operations to execute their respective read-arbitrate-commit operations concurrently with a statistically low collision between the tag-bit storage and the data storage. We conclude this work with results from utilizing the commands to perform a traditional multi-threaded mutex algorithm as well as a multi-threaded static tree barrier that exhibit sub-linear scaling.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114732597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TAPAS: Temperature-aware Adaptive Placement for 3D Stacked Hybrid Caches","authors":"Majed Valad Beigi, G. Memik","doi":"10.1145/2989081.2989085","DOIUrl":"https://doi.org/10.1145/2989081.2989085","url":null,"abstract":"3D integration enables large last level caches (LLCs) to be stacked onto a die. In addition, emerging Non Volatile Memories (NVMs) such as Spin-Torque Transfer RAM (STT-RAM) have been explored as a replacement for traditional SRAM-based LLCs due to their higher density and lower leakage power. In this paper, we aim to use the benefits of the integration of STT-RAM in a 3D multi-core environment. The main challenge we try to address is the high operating temperatures. The higher power density of 3D ICs might incur temperature-related problems in reliability, power consumption, and performance. Specifically, recent works have shown that elevated operating temperatures can adversely impact STT-RAM performance. To alleviate the temperature-induced problems, we propose TAPAS, a low-cost temperature-aware adaptive block placement and migration policy, for a hybrid LLC that includes STT-RAM and SRAM structures. This technique places cache blocks according to their temperature characteristics. Specifically, the cache blocks that heat up a hot bank are recognized and migrated to a cooler bank to 1) enable those blocks to get accessed in a cooler bank with lower read/write latency and 2) reduce the number of accesses to the hotter bank. We design and evaluate a novel flow control mechanism to assign priorities to those cache blocks to reach their destination. Evaluation results reveal that TAPAS achieves, on average, 11.6% performance improvement, 6.5% power, and 5.6°C peak temperature reduction compared to a state-of-the art hybrid cache design.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"291 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114383615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shang Li, Po-Chun Huang, D. Banks, Max DePalma, A. Elshaarany, K. Hemmert, Arun Rodrigues, E. Ruppel, Yitian Wang, Jim Ang, B. Jacob
{"title":"Low Latency, High Bisection-Bandwidth Networks for Exascale Memory Systems","authors":"Shang Li, Po-Chun Huang, D. Banks, Max DePalma, A. Elshaarany, K. Hemmert, Arun Rodrigues, E. Ruppel, Yitian Wang, Jim Ang, B. Jacob","doi":"10.1145/2989081.2989130","DOIUrl":"https://doi.org/10.1145/2989081.2989130","url":null,"abstract":"Data movement is the limiting factor in modern supercomputing systems, as system performance drops by several orders of magnitude whenever applications need to move data. Therefore, focusing on low latency (e.g., low diameter) networks that also have high bisection bandwidth is critical. We present a cost/performance analysis of a wide range of high-radix interconnect topologies, in terms of bisection widths, average hop counts, and the port costs required to achieve those metrics. We study variants of traditional topologies as well as one novel topology. We identify several designs that have reasonable port costs and can scale to hundreds of thousands, perhaps millions, of nodes with maximum latencies as low as two network hops and high bisection bandwidths.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116372523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replacement Policies for Heterogeneous Memories","authors":"Jacob Brock, Chencheng Ye, C. Ding","doi":"10.1145/2989081.2989123","DOIUrl":"https://doi.org/10.1145/2989081.2989123","url":null,"abstract":"As non-volatile memory is introduced alongside traditional memory, new mechanisms for managing memory are becoming necessary. In this paper, we propose the two variable-space heterogeneous VMIN (H-VMIN) and heterogeneous WS (H-WS) policies for flat DRAM-PCM heterogeneous architectures, which derive from the earlier VMIN and WS policies. After a page reference, H-VMIN keeps the page in DRAM/PCM/disk based on the time until its next access. It is optimal, but requires future information. H-WS keeps the page in DRAM for a certain time and then in PCM for a longer time if it has still not been reused, and finally evicts it to disk.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115711486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Metric to Measure Cache Utilization for HPC Workloads","authors":"Aditya M. Deshpande, J. Draper","doi":"10.1145/2989081.2989125","DOIUrl":"https://doi.org/10.1145/2989081.2989125","url":null,"abstract":"High performance computing (HPC) systems continue to add cores and memory to keep pace with increases in data processing needs, resulting in increased data movement across the memory hierarchy. With these systems becoming more and more energy constrained, data movement costs in terms of energy and performance cannot be neglected. Conventional techniques for modeling and analyzing data movement across the memory hierarchy have proven to be inadequate in helping computer architects and system designers to optimize data movement. In this work, we present modeling approaches to help capture and better understand cache utilization in the various levels of the memory hierarchy. We define a new metric, average cache references per evictions (ACRE), as a measure of cache utilization. We observed that the ACRE values for L1 cache varies from 18 to 210 for Mantevo miniapps and from 11 to 55 for GraphBIG benchmarks. ACRE values for L2/L3 caches were observed to be around 1 for all benchmarks. Such cache utilization metrics provide more meaningful insights about the data movement occurring across the memory hierarchy, enabling computer architects and system designers to better manage and minimize data movement and in turn reduce energy and even improve performance.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126581112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dense Footprint Cache: Capacity-Efficient Die-Stacked DRAM Last Level Cache","authors":"Seunghee Shin, Sihong Kim, Yan Solihin","doi":"10.1145/2989081.2989096","DOIUrl":"https://doi.org/10.1145/2989081.2989096","url":null,"abstract":"Die-stacked DRAM technology enables a large Last Level Cache (LLC) that provides high bandwidth data access to the processor. However, it requires a large tag array that may take a significant portion of the on-chip SRAM budget. To reduce this SRAM overhead, systems like Intel Haswell relies on a large block (Mblock) size. One drawback of a large Mblock size is that many bytes of an Mblock are not needed by the processor but are fetched into the cache. A recent technique (Footprint cache) to solve this problem works by dividing the Mblock into smaller blocks where only blocks predicted to be needed by the processor are brought into the LLC. While it helps to alleviate the excessive bandwidth consumption from fetching unneeded blocks, the capacity waste remains: only blocks that are predicted useful are fetched and allocated, and the remaining area of the Mblock is left empty, creating holes. Unfortunately, holes create significant capacity overheads which could have been used for useful data, hence wasted refresh power on useless data. In this paper, we propose a new design, Dense Footprint Cache (DFC). Similar to Footprint cache, DFC uses a large Mblock and relies on useful block prediction in order to reduce memory bandwidth consumption. However, when blocks of an Mblock are fetched, the blocks are placed contiguously in the cache, thereby eliminating holes, increasing capacity and power efficiency, and increasing performance. Mblocks in DFC have variable sizes and a cache set has a variable associativity, hence it presents new challenges in designing its management policies (placement, replacement, and update). Through simulation of Big Data applications, we show that DFC reduces LLC miss ratios by about 43%, speeds up applications by 9.5%, while consuming 4.3% less energy on average.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123241821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving DRAM Bandwidth Utilization with MLP-Aware OS Paging","authors":"Rishiraj A. Bheda, T. Conte, J. Vetter","doi":"10.1145/2989081.2989094","DOIUrl":"https://doi.org/10.1145/2989081.2989094","url":null,"abstract":"Optimal use of available memory bank-level parallelism and channel bandwidth heavily impacts the performance of an application. Research studies have focused on improving bandwidth utilization by employing scheduling policies and request re-ordering techniques at the memory controller. However, potential to extract memory performance by intelligent page allocation that maximizes opportunity for bank-level parallelism and row buffer hits is often overlooked. The actual physical page location in memory has a huge impact on bank conflicts and potential for prioritizing low-latency requests such as row buffer hits. We demonstrate that with more intelligent virtual to physical paging mechanisms it is possible to reduce bank conflicts at the memory and achieve higher bandwidth utilization. Such intelligent paging mechanisms can then form a basis for other request re-ordering techniques to further improve memory performance. In this study we only focus on virtual-to-physical paging techniques and demonstrate 38.4% improvement on DRAM bandwidth utilization with a profile-based scheme. We study a wide variety of workloads from varied benchmark suites. We present results for profile based as well as preliminary results for dynamically adaptive paging techniques. Our results demonstrate improved bandwidth utilization with DRAM aware page layouts. Dynamic paging schemes further demonstrate the potential of run-time adaptive techniques in improving bandwidth utilization of increasingly parallel multi-channel main memory systems.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129783721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Write Locality and Optimization for Persistent Memory","authors":"Dong Chen, Chencheng Ye, C. Ding","doi":"10.1145/2989081.2989119","DOIUrl":"https://doi.org/10.1145/2989081.2989119","url":null,"abstract":"Persistent memory is a disruptive technology that drastically reduces memory cost and static power but introduces the problems of slow writes and limited write endurance. An effective solution is caching. However, existing cache has been designed for fast reads. It does not minimize the number of writebacks from cache to memory. In this paper, we propose a metric to quantify the write locality and a theory to analyze and optimize write locality. It includes a linear-time algorithm to predict the write-back frequency for all cache sizes. In shared cache, it predicts the number of writebacks for co-run programs based on sole-run profiling. The paper evaluates the accuracy of the prediction against cache simulation. It then uses the theory to optimize write locality in a set of co-run programs in shared cache by cache partitioning. The theory predicts that such write-locality optimization can reduce the number of writebacks by 12% to 35%, compared to uncontrolled cache sharing.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125387315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}