{"title":"Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore","authors":"S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay","doi":"10.1145/2818950.2818952","DOIUrl":"https://doi.org/10.1145/2818950.2818952","url":null,"abstract":"A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126881457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Opportunities to Upgrade Main Memory","authors":"D. Resnick","doi":"10.1145/2818950.2818960","DOIUrl":"https://doi.org/10.1145/2818950.2818960","url":null,"abstract":"Hybrid Memory Cube (HMC), in production by Micron Technology, is a new DRAM component that has multiple advantages over current parts including higher bandwidth, lower energy, abstract and more pin efficient interface and other benefits. The memory technology can be used as a base for even further improvements, including upgrading memory scalability to multiple terabytes and terabyte per second bandwidths per processor and resilience such that even large supercomputers with 100s of petabytes of memory will have reliable memory systems. Future systems, from desktops up, will have memory systems of multiple levels, including DRAM and non-volatile (NAND?) components that are both first-level memory capabilities, along with DRAM or SRAM scratch memory such that total data motion is greatly reduced. The result can be improved system performance and reduced system power.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134085878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Eslami, Alfredo J. Velasco, Alireza Vahid, Georgios Mappouras, A. Calderbank, Daniel J. Sorin
{"title":"Writing without Disturb on Phase Change Memories by Integrating Coding and Layout Design","authors":"A. Eslami, Alfredo J. Velasco, Alireza Vahid, Georgios Mappouras, A. Calderbank, Daniel J. Sorin","doi":"10.1145/2818950.2818962","DOIUrl":"https://doi.org/10.1145/2818950.2818962","url":null,"abstract":"We integrate coding techniques and layout design to eliminate write-disturb in phase change memories (PCMs), while enhancing lifetime and host-visible capacity. We first propose a checkerboard configuration for cell layout to eliminate write-disturb while doubling the memory lifetime. We then introduce two methods to jointly design Write-Once-Memory (WOM) codes and layout. The first WOM-layout design improves the lifetime by more than double without compromising the host-visible capacity. The second design applies WOM codes to even more dense layouts to achieve both lifetime and capacity gains. The constructions demonstrate that substantial improvements to lifetime and host-visible capacity are possible by co-designing coding and cell layout in PCM.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129399825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob
{"title":"Bringing Modern Hierarchical Memory Systems Into Focus: A study of architecture and workload factors on system performance","authors":"Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob","doi":"10.1145/2818950.2818975","DOIUrl":"https://doi.org/10.1145/2818950.2818975","url":null,"abstract":"The increasing size of workloads has led to the development of new technologies and architectures that are intended to help address the capacity limitations of DRAM main memories. The proposed solutions fall into two categories: those that re-engineer Flash-based SSDs to further improve storage system performance and those that incorporate non-volatile technology into a Hybrid main memory system. These developments have blurred the line between the storage and memory systems. In this paper, we examine the differences between these two approaches to gain insight into the types of applications and memory technologies that benefit the most from these different architectural approaches. In particular this work utilizes full system simulation to examine the impact of workload randomness on system performance, the impact of backing store latency on system performance, and how the different implementations utilize system resources differently. We find that the software overhead incurred by storage based implementations can account for almost 50% of the overall access latency. As a result, backing store technologies that have an access latency up to 25 microseconds tend to perform better when implemented as part of the main memory system. We also see that high degrees of random access can exacerbate the software overhead problem and lead to large performance advantages for the Hybrid main memory approach. Meanwhile, the page replacement algorithm utilized by the OS in the storage approach results in considerably better performance on highly sequential workloads at the cost of greater pressure on the cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Opportunities and Challenges of Performing Vector Operations inside the DRAM","authors":"M. Alves, P. C. Santos, M. Diener, L. Carro","doi":"10.1145/2818950.2818953","DOIUrl":"https://doi.org/10.1145/2818950.2818953","url":null,"abstract":"In order to overcome the low memory bandwidth and the high energy costs associated with the data transfer between the processor and the main memory, proposals on near-data computing started to gain acceptance in systems ranging from embedded architectures to high performance computing. The main previous approaches propose application specific hardware or require a large amount of logic. Moreover, most proposals require algorithm changes and do not make use of the full parallelism available on the DRAM devices. These issues limits the adoption and the performance of near-data computing. In this paper, we propose to implement vector instructions directly inside the DRAM devices, which we call the Memory Vector Extensions (MVX). This balanced approach reduces data movement between the DRAM to the processor while requiring a low amount of hardware to achieve good performance. Comparing to current vector operations present on processors, our proposal enable performance gains of up to 97x and reduces the energy consumption by up to 70x of the full system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115844387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving Yield, Density and Performance Effective DRAM at Extreme Technology Sizes","authors":"B. Childers, Jun Yang, Youtao Zhang","doi":"10.1145/2818950.2818963","DOIUrl":"https://doi.org/10.1145/2818950.2818963","url":null,"abstract":"For over forty years, DRAM has been the most compelling choice for main memory. It is a well understood commodity technology that strikes an ideal balance between cost, performance, capacity and energy. Yet, as DRAM scales to the extremes of deep submicron technology, it faces a critical challenge with the impact of process variation (PV) on chip yield: PV in the transistor and capacitor used to hold a bit of information, along with other components, can cause critical requirements to be violated, including retention capability, cell reliability and operational timing. The challenges of retention and reliability are well known. However, the latter challenge has received significantly less attention---the impact of operational timing violations due to PV on DRAM yield. This challenge stands as an equal to the others in achieving sufficient yield for continued commodity production of DRAM. In this paper, we argue that timing requirements must be relaxed and exposed on a per-location basis for management by the memory sub-system architecture to overcome the challenge to yield from timing. This \"soft yield\" approach trades exposed timing variability for enhanced yield, without harming chip density. Because relaxing and exposing variable timing can lead to application performance loss, a suite of techniques must be developed by the architecture community to mitigate the loss. We raise awareness of this problem and suggest directions where solutions may be found.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128168891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy Efficient Scale-In Clusters with In-Storage Processing for Big-Data Analytics","authors":"I. Choi, Yang-Suk Kee","doi":"10.1145/2818950.2818983","DOIUrl":"https://doi.org/10.1145/2818950.2818983","url":null,"abstract":"Big data drives a computing paradigm shift. Due to enormous data volumes, data-intensive programming frameworks are pervasive and scale-out clusters are widespread. As a result, data-movement energy dominates overall energy consumption and this will get worse with a technology scaling. We propose scale-in clusters with In-Storage Processing (ISP) devices that would enable energy efficient computing for big-data analytics. ISP devices eliminate/reduce data movements towards CPUs and execute tasks more energy-efficiently. Thus, with energy efficient computing near data and higher throughput enabled, clusters with ISP can achieve more than quadruple energy efficiency with fewer number of nodes as compared to the energy efficiency of similarly performing its counter-part scale-out clusters.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130023882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Herniated Hash Tables: Exploiting Multi-Level Phase Change Memory for In-Place Data Expansion","authors":"Zhaoxia Deng, Lunkai Zhang, D. Franklin, F. Chong","doi":"10.1145/2818950.2818981","DOIUrl":"https://doi.org/10.1145/2818950.2818981","url":null,"abstract":"Hash tables are a commonly used data structure used in many algorithms and applications. As applications and data scale, the efficient implementation of hash tables becomes increasingly important and challenging. In particular, memory capacity becomes increasingly important and entries can become asymmetrically chained across hash buckets. This chaining prevents two forms of parallelism: memory-level parallelism (allowing multiple prefetch requests to overlap) and memory-computation parallelism (allowing computation to overlap memory operations). We propose, herniated hash tables, a technique that exploits multi-level phase change memory (PCM) storage to expand storage at each hash bucket and increase parallelism without increasing physical space. The technique works by increasing the number of bits stored within the same resistance range of an individual PCM cell. We pack more data into the same bit by decreasing noise margins, and we pay for this higher density with higher latency reads and writes that resolve the more accurate resistance values. Furthermore, our organization, coupled with an addressing and prefetching scheme, increases memory parallelism of the herniated datastructure. We simulate our system with a variety of hash table applications and evaluate the density and performance benefits in comparison to a number of baseline systems. Compared with conventional chained hash tables on single-level PCM, herniated hash tables can achieve 4.8x density on a 4-level PCM while achieving up to 67% performance improvement.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127914061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIMT-based Logic Layers for Stacked DRAM Architectures: A Prototype","authors":"C. Kersey, S. Yalamanchili, Hyesoon Kim","doi":"10.1145/2818950.2818954","DOIUrl":"https://doi.org/10.1145/2818950.2818954","url":null,"abstract":"Stacked DRAM products are now available, and the likelihood of future products combining DRAM stacks with custom logic layers seems high. The near-memory processor in such a system will have to be energy efficient, latency tolerant, and capable of exploiting both high memory-level parallelism and high memory bandwidth. We believe that single-instruction-multiple-thread (SIMT) processors are uniquely suited to this task, and for the purpose of evaluating this claim have produced an FPGA-based prototype.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126091851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads","authors":"A. Laer, William Wang, C. D. Emmons","doi":"10.1145/2818950.2818980","DOIUrl":"https://doi.org/10.1145/2818950.2818980","url":null,"abstract":"With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115619118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}