Proceedings of the Second International Symposium on Memory Systems最新文献_第2页

DRAMPersist: Making DRAM Systems Persistent 使DRAM系统持久

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989110

Krishna T. Malladi, M. Awasthi, Hongzhong Zheng

引用次数: 1

ConGen: An Application Specific DRAM Memory Controller Generator ConGen:一个特定应用的DRAM内存控制器生成器

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989131

Matthias Jung, Deepak M. Mathew, C. Weis, N. Wehn, Irene Heinrich, Marco V. Natale, S. O. Krumke

引用次数: 24

Exploring Tag-Bit Memory Operations in Hybrid Memory Cubes 探索混合内存立方体中的标签位内存操作

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989105

John D. Leidel, Yong Chen

{"title":"Exploring Tag-Bit Memory Operations in Hybrid Memory Cubes","authors":"John D. Leidel, Yong Chen","doi":"10.1145/2989081.2989105","DOIUrl":"https://doi.org/10.1145/2989081.2989105","url":null,"abstract":"The recent advances in multi-dimensional or stacked memory devices have led to a significant resurgence in research and effort associated with exploring more expressive memory operations in order to improve application throughput. The goal of these efforts is to provide memory operations in the logic layer of a stacked device that provide pseudo processing near memory capabilities to reduce the bandwidth required to perform common operations across concurrent applications. One such area of concern in applications is the ability to provide high performance, low latency mutexes and associated barrier synchronization techniques. Previous attempts at performing cache-based mutex optimization and tiered barrier synchronization provide some degree of application speedup, but still induce sub-optimal scenarios such as cache line contention and large degrees of message traffic. However, several previous architectures have presented techniques that extend the core physical address storage with additional, more expressive bit storage in order to provide fine-grained concurrency mechanisms in hardware. This work presents a novel methodology and associated implementation for providing in-situ extended memory operations in an HMC Gen2 device. The methodology provides a single lock, or tag bit for every 64-bit word in memory using the in-situ storage. Further, we present an address inversion technique that enables the tag-bit operations to execute their respective read-arbitrate-commit operations concurrently with a statistically low collision between the tag-bit storage and the data storage. We conclude this work with results from utilizing the commands to perform a traditional multi-threaded mutex algorithm as well as a multi-threaded static tree barrier that exhibit sub-linear scaling.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114732597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TAPAS: Temperature-aware Adaptive Placement for 3D Stacked Hybrid Caches 3D堆叠混合缓存的温度感知自适应放置

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989085

Majed Valad Beigi, G. Memik

{"title":"TAPAS: Temperature-aware Adaptive Placement for 3D Stacked Hybrid Caches","authors":"Majed Valad Beigi, G. Memik","doi":"10.1145/2989081.2989085","DOIUrl":"https://doi.org/10.1145/2989081.2989085","url":null,"abstract":"3D integration enables large last level caches (LLCs) to be stacked onto a die. In addition, emerging Non Volatile Memories (NVMs) such as Spin-Torque Transfer RAM (STT-RAM) have been explored as a replacement for traditional SRAM-based LLCs due to their higher density and lower leakage power. In this paper, we aim to use the benefits of the integration of STT-RAM in a 3D multi-core environment. The main challenge we try to address is the high operating temperatures. The higher power density of 3D ICs might incur temperature-related problems in reliability, power consumption, and performance. Specifically, recent works have shown that elevated operating temperatures can adversely impact STT-RAM performance. To alleviate the temperature-induced problems, we propose TAPAS, a low-cost temperature-aware adaptive block placement and migration policy, for a hybrid LLC that includes STT-RAM and SRAM structures. This technique places cache blocks according to their temperature characteristics. Specifically, the cache blocks that heat up a hot bank are recognized and migrated to a cooler bank to 1) enable those blocks to get accessed in a cooler bank with lower read/write latency and 2) reduce the number of accesses to the hotter bank. We design and evaluate a novel flow control mechanism to assign priorities to those cache blocks to reach their destination. Evaluation results reveal that TAPAS achieves, on average, 11.6% performance improvement, 6.5% power, and 5.6°C peak temperature reduction compared to a state-of-the art hybrid cache design.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"291 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114383615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Low Latency, High Bisection-Bandwidth Networks for Exascale Memory Systems 用于百亿亿级存储器系统的低延迟、高分割带宽网络

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989130

Shang Li, Po-Chun Huang, D. Banks, Max DePalma, A. Elshaarany, K. Hemmert, Arun Rodrigues, E. Ruppel, Yitian Wang, Jim Ang, B. Jacob

引用次数: 4

Replacement Policies for Heterogeneous Memories 异构内存的替换策略

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989123

Jacob Brock, Chencheng Ye, C. Ding

引用次数: 2

A New Metric to Measure Cache Utilization for HPC Workloads 一种衡量HPC工作负载缓存利用率的新指标

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989125

Aditya M. Deshpande, J. Draper

{"title":"A New Metric to Measure Cache Utilization for HPC Workloads","authors":"Aditya M. Deshpande, J. Draper","doi":"10.1145/2989081.2989125","DOIUrl":"https://doi.org/10.1145/2989081.2989125","url":null,"abstract":"High performance computing (HPC) systems continue to add cores and memory to keep pace with increases in data processing needs, resulting in increased data movement across the memory hierarchy. With these systems becoming more and more energy constrained, data movement costs in terms of energy and performance cannot be neglected. Conventional techniques for modeling and analyzing data movement across the memory hierarchy have proven to be inadequate in helping computer architects and system designers to optimize data movement. In this work, we present modeling approaches to help capture and better understand cache utilization in the various levels of the memory hierarchy. We define a new metric, average cache references per evictions (ACRE), as a measure of cache utilization. We observed that the ACRE values for L1 cache varies from 18 to 210 for Mantevo miniapps and from 11 to 55 for GraphBIG benchmarks. ACRE values for L2/L3 caches were observed to be around 1 for all benchmarks. Such cache utilization metrics provide more meaningful insights about the data movement occurring across the memory hierarchy, enabling computer architects and system designers to better manage and minimize data movement and in turn reduce energy and even improve performance.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126581112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Dense Footprint Cache: Capacity-Efficient Die-Stacked DRAM Last Level Cache 密集内存缓存:容量高效的封装DRAM最后一级缓存

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989096

Seunghee Shin, Sihong Kim, Yan Solihin

{"title":"Dense Footprint Cache: Capacity-Efficient Die-Stacked DRAM Last Level Cache","authors":"Seunghee Shin, Sihong Kim, Yan Solihin","doi":"10.1145/2989081.2989096","DOIUrl":"https://doi.org/10.1145/2989081.2989096","url":null,"abstract":"Die-stacked DRAM technology enables a large Last Level Cache (LLC) that provides high bandwidth data access to the processor. However, it requires a large tag array that may take a significant portion of the on-chip SRAM budget. To reduce this SRAM overhead, systems like Intel Haswell relies on a large block (Mblock) size. One drawback of a large Mblock size is that many bytes of an Mblock are not needed by the processor but are fetched into the cache. A recent technique (Footprint cache) to solve this problem works by dividing the Mblock into smaller blocks where only blocks predicted to be needed by the processor are brought into the LLC. While it helps to alleviate the excessive bandwidth consumption from fetching unneeded blocks, the capacity waste remains: only blocks that are predicted useful are fetched and allocated, and the remaining area of the Mblock is left empty, creating holes. Unfortunately, holes create significant capacity overheads which could have been used for useful data, hence wasted refresh power on useless data. In this paper, we propose a new design, Dense Footprint Cache (DFC). Similar to Footprint cache, DFC uses a large Mblock and relies on useful block prediction in order to reduce memory bandwidth consumption. However, when blocks of an Mblock are fetched, the blocks are placed contiguously in the cache, thereby eliminating holes, increasing capacity and power efficiency, and increasing performance. Mblocks in DFC have variable sizes and a cache set has a variable associativity, hence it presents new challenges in designing its management policies (placement, replacement, and update). Through simulation of Big Data applications, we show that DFC reduces LLC miss ratios by about 43%, speeds up applications by 9.5%, while consuming 4.3% less energy on average.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123241821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving DRAM Bandwidth Utilization with MLP-Aware OS Paging 利用感知mlp的操作系统分页提高DRAM带宽利用率

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989094

Rishiraj A. Bheda, T. Conte, J. Vetter

{"title":"Improving DRAM Bandwidth Utilization with MLP-Aware OS Paging","authors":"Rishiraj A. Bheda, T. Conte, J. Vetter","doi":"10.1145/2989081.2989094","DOIUrl":"https://doi.org/10.1145/2989081.2989094","url":null,"abstract":"Optimal use of available memory bank-level parallelism and channel bandwidth heavily impacts the performance of an application. Research studies have focused on improving bandwidth utilization by employing scheduling policies and request re-ordering techniques at the memory controller. However, potential to extract memory performance by intelligent page allocation that maximizes opportunity for bank-level parallelism and row buffer hits is often overlooked. The actual physical page location in memory has a huge impact on bank conflicts and potential for prioritizing low-latency requests such as row buffer hits. We demonstrate that with more intelligent virtual to physical paging mechanisms it is possible to reduce bank conflicts at the memory and achieve higher bandwidth utilization. Such intelligent paging mechanisms can then form a basis for other request re-ordering techniques to further improve memory performance. In this study we only focus on virtual-to-physical paging techniques and demonstrate 38.4% improvement on DRAM bandwidth utilization with a profile-based scheme. We study a wide variety of workloads from varied benchmark suites. We present results for profile based as well as preliminary results for dynamically adaptive paging techniques. Our results demonstrate improved bandwidth utilization with DRAM aware page layouts. Dynamic paging schemes further demonstrate the potential of run-time adaptive techniques in improving bandwidth utilization of increasingly parallel multi-channel main memory systems.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129783721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Write Locality and Optimization for Persistent Memory 持久内存的写局部性和优化

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989119

Dong Chen, Chencheng Ye, C. Ding

引用次数: 8