Arkaprava Basu, Sooraj Puthoor, Shuai Che, Bradford M. Beckmann
{"title":"Software Assisted Hardware Cache Coherence for Heterogeneous Processors","authors":"Arkaprava Basu, Sooraj Puthoor, Shuai Che, Bradford M. Beckmann","doi":"10.1145/2989081.2989092","DOIUrl":"https://doi.org/10.1145/2989081.2989092","url":null,"abstract":"Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically depends upon the ability to efficiently support cache coherence and shared virtual memory across tightly-integrated CPUs and GPUs. However, throughput-oriented GPUs easily overwhelm existing hardware coherence mechanisms that long kept the cache hierarchies in multi-core CPUs coherent. This paper proposes a novel solution called Software Assisted Hardware Coherence (SAHC) to scale cache coherence to future heterogeneous processors. We observe that the system software (Operating system and runtime) often has semantic knowledge about sharing patterns of data across the CPU and the GPU. This high-level knowledge can be utilized to effectively provide cache coherence across throughput-oriented GPUs and latency-sensitive CPUs in a heterogeneous processor. SAHC thus proposes a hybrid software-hardware mechanism that judiciously uses hardware coherence only when needed while using software's knowledge to filter out most of the unnecessary coherence traffic. Our evaluation suggests that SAHC can often eliminate up to 98-100% of the hardware coherence lookups, resulting up to 49% reduction in runtime.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123099523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processing Acceleration with Resistive Memory-based Computation","authors":"M. Imani, Yan Cheng, T. Simunic","doi":"10.1145/2989081.2989086","DOIUrl":"https://doi.org/10.1145/2989081.2989086","url":null,"abstract":"The Internet of Things significantly increases the amount of data generated that strains the processing capability of current computing systems. Approximate computing can accelerate the computation and dramatically reduce the energy consumption with controllable accuracy loss. In this paper, we propose a Resistive Associative Unit, called RAU, which approximates computation alongside processing cores. RAU exploits the data locality with associative memory. It finds a row which has the closest distance to input patterns while considering the impact of each bit index on the computation accuracy. Our evaluation shows that RAU can accelerate the GPGPU computation by 1.15x and improve the energy efficiency by 36% at only 10% accuracy loss.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116102046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube","authors":"J. Schmidt, H. Fröning, U. Brüning","doi":"10.1145/2989081.2989099","DOIUrl":"https://doi.org/10.1145/2989081.2989099","url":null,"abstract":"Through-Silicon Vias (TSVs) and three-dimensional die stacking technologies are enabling a combination of DRAM and CMOS die layer within a single stack, leading to stacked memory. Functionality that was previously associated with the microprocessor, e.g. memory controllers, can now be integrated into the memory cube, allowing to packetize the interface for improved performance and reduced energy consumption per bit. Complex memory networks become feasible as the logic layer can include routing functionality. The massive amount of connectivity among the different die layers by the use of TSVs in combination with the packetized interface leads to a substantial improvement of memory access bandwidth. However, leveraging this vast bandwidth increase from an application point of view is not as simple as it seems. In this paper, we point out multiple pitfalls when accessing a stacked memory, namely the Hybrid Memory Cube (HMC) in combination with the publicly available openHMC host controller. HMCs internal architecture still has many similarities with traditional DRAM chips like the page-based access, but it is internally partitioned into multiple vaults. Each vault comprises a memory controller and multiple DRAM banks. Pages are rather small and rely on a closed-page policy. Also, the ratio of read and write operations has an optimum of which the application should be aware. The built-in support for atomic operations sounds like a great opportunity for off-loading, but the impact of contention cannot be neglected. Besides exploring such performance pitfalls, we further start exploring the energy efficiency of memory accesses to stacked memory.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132978751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Second International Symposium on Memory Systems","authors":"B. Jacob","doi":"10.1145/2989081","DOIUrl":"https://doi.org/10.1145/2989081","url":null,"abstract":"","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132352806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MicroRefresh: Minimizing Refresh Overhead in DRAM Caches","authors":"N. Gulur, R. Govindarajan, M. Mehendale","doi":"10.1145/2989081.2989100","DOIUrl":"https://doi.org/10.1145/2989081.2989100","url":null,"abstract":"DRAM memory systems require periodic recharging to avoid loss of data from leaky capacitors. These refresh operations consume energy and reduce the duration of time for which the DRAM banks are available to service memory requests. Higher DRAM density and 3D-stacking aggravate the refresh overheads, incurring even higher energy and performance costs. 3D-stacked DRAM and other emerging on-chip High Bandwidth Memory (HBM) technologies which are widely considered to be changing the landscape of memory hierarchy in future heterogeneous and many-core architectures could suffer significantly from refresh overheads. Such large on-chip memory, when used as a very large last-level cache, however, provides opportunities for addressing the refresh overheads. In this work, we propose MicroRefresh, a scheme for almost eliminating the refresh overhead in DRAM caches. MicroRefresh eliminates unwanted refresh of recently accessed DRAM pages; it takes advantage of the relative latency difference between on-chip and off-chip DRAM and achieves a fine balance of usage of system resources by aggressively opportunistically eliminating refresh of older DRAM pages. It tolerates any resulting increase in cache misses by leveraging the under-utilized main memory bandwidth. The resulting organization eliminates the energy and performance overhead of refresh operations in the DRAM cache to achieve overall performance and energy improvement. Across both 4-core and 8-core workloads, MicroRefresh eliminates 92% the refresh energy consumed in the baseline periodic refresh mechanism. Further this is accompanied by performance improvements of upto 10%, with average improvements of 3.9% and 3.4% in 4-core and 8-core respectively.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"174 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131161035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Zivanovic, M. Radulovic, Germán Llort, D. Zaragoza, J. Strassburg, P. Carpenter, Petar Radojkovic, E. Ayguadé
{"title":"Large-Memory Nodes for Energy Efficient High-Performance Computing","authors":"D. Zivanovic, M. Radulovic, Germán Llort, D. Zaragoza, J. Strassburg, P. Carpenter, Petar Radojkovic, E. Ayguadé","doi":"10.1145/2989081.2989083","DOIUrl":"https://doi.org/10.1145/2989081.2989083","url":null,"abstract":"Energy consumption is by far the most important contributor to HPC cluster operational costs, and it accounts for a significant share of the total cost of ownership. Advanced energy-saving techniques in HPC components have received significant research and development effort, but a simple measure that can dramatically reduce energy consumption is often overlooked. We show that, in capacity computing, where many small to medium-sized jobs have to be solved at the lowest cost, a practical energy-saving approach is to scale-in the application on large-memory nodes. We evaluate scaling-in; i.e. decreasing the number of application processes and compute nodes (servers) to solve a fixed-sized problem, using a set of HPC applications running in a production system. Using standard-memory nodes, we obtain average energy savings of 36%, already a huge figure. We show that the main source of these energy savings is a decrease in the node-hours (node_hours = #nodes x exe_time), which is a consequence of the more efficient use of hardware resources. Scaling-in is limited by the per-node memory capacity. We therefore consider using large-memory nodes to enable a greater degree of scaling-in. We show that the additional energy savings, of up to 52%, mean that in many cases the investment in upgrading the hardware would be recovered in a typical system lifetime of less than five years.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129213037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahzabeen Islam, Soumi Banerjee, Mitesh R. Meswani, K. Kavi
{"title":"Prefetching as a Potentially Effective Technique for Hybrid Memory Optimization","authors":"Mahzabeen Islam, Soumi Banerjee, Mitesh R. Meswani, K. Kavi","doi":"10.1145/2989081.2989129","DOIUrl":"https://doi.org/10.1145/2989081.2989129","url":null,"abstract":"The promise of 3D-stacked memory solving the memory wall has led to many emerging architectures that integrate 3D-stacked memory into processor memory in a variety of ways including systems that utilize different memory technologies, with different performance and power characteristics, to comprise the system memory. It then becomes necessary to manage these memories such that we get the performance of the fastest memory while having the capacity of the slower but larger memories. Some research in industry and academia proposed using 3D-stacked DRAM as a hardware managed cache. More recently, particularly pushed by the demands for ever larger capacities, researchers are exploring the use of multiple memory technologies as a single main memory. The main challenge for such flat-address-space memories is the placement and migration of memory pages to increase the number of requests serviced from faster memory, as well as managing overhead due to page migrations. In this paper we ask a different question: can traditional prefetching be a viable solution for effective management of hybrid memories? We conjecture that by tuning well-known prefetch mechanism for hybrid memories we can achieve substantial performance improvement. To test our conjecture, we compared the state of the art CAMEO migration policy with a Markov-like prefetcher for a hybrid memory consisting of HBM (3D-stacked DRAM) and Phase Change Memory (PCM) using a set of SPEC CPU2006 and several HPC benchmarks. We find that CAMEO provides better performance improvement than prefetching for 2/3rd of the workloads (by 59%) and prefetching is better than CAMEO for the remaining 1/3rd (by 19%). The EDP analysis shows that the prefetching solution improves EDP over the no-prefetching baseline whereas CAMEO does worse in terms of average EDP. These results indicate that prefetching should be reconsidered as a supplementary technique to data migration.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131403943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrated Thermal Analysis for Processing In Die-Stacking Memory","authors":"Yuxiong Zhu, Borui Wang, Dong Li, Jishen Zhao","doi":"10.1145/2989081.2989093","DOIUrl":"https://doi.org/10.1145/2989081.2989093","url":null,"abstract":"Recent application and technology trends bring a renaissance of the processing-in-memory (PIM), which was envisioned decades ago. In particular, die-stacking and silicon interposer technologies enable the integration of memory, PIMs, and the host CPU in a single chip. Yet the integration substantially increases system power density. This can impose substantial thermal challenges to the feasibility of such systems. In this paper, we comprehensively study the thermal feasibility of integrated systems consisting of the host CPU, die-stacking DRAMs, and various types of PIMs. Compared with most previous thermal studies that only focus on the memory stack, we investigate the thermal distribution of the whole processor-memory system. Furthermore, we examine the feasibility of various cooling solutions and feasible scale of various PIM designs under given thermal and area constraints. Finally, we demonstrate system run-time thermal feasibility by executing two high-performance computing applications with PIM-based systems. Based on our experimental studies, we reveal a set of thermal implications for PIM-based system design and configuration.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117228953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying Software-based Memory Error Correction for In-Memory Key-Value Store: Case Studies on Memcached and RAMCloud","authors":"Yin Li, Hao Wang, Xiaoqing Zhao, Hongbin Sun, Tong Zhang","doi":"10.1145/2989081.2989091","DOIUrl":"https://doi.org/10.1145/2989081.2989091","url":null,"abstract":"With the nature of being memory hungry, in-memory key-value store is fundamentally subject to very high memory cost and energy consumption. Intuitively, the availability of a strong memory error correction at sufficiently small redundancy overhead could be leveraged to reduce memory cost and/or energy consumption. Nevertheless, current computing systems handle memory error correction solely in the hardware stack with very weak error correction strength. This paper for the first time studies the practical feasibility of implementing strong memory error correction code (ECC) in the software stack for in-memory key-value store without incurring significant speed performance penalty. This is fundamentally enabled by the low memory bandwidth utilization and relatively simple data structure of in-memory key-value store, which are actually shared with many other datacenter applications (e.g., Web search). This paper presents several design techniques to optimize software-based ECC implementation for in-memory key-value store, and elaborates on several important design issues. Using Memcached and RAMCloud as test vehicles, this work shows that the proposed design solution can improve the memory error correction strength by several orders of magnitude at similar (and even less) coding redundancy compared with current hardware-based design practice, and meanwhile incur less than 6% degradation of in-memory key-value store operational throughput.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115821505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the feasibility of storage class memory as main memory","authors":"G. S. Lloyd, M. Gokhale","doi":"10.1145/2989081.2989118","DOIUrl":"https://doi.org/10.1145/2989081.2989118","url":null,"abstract":"Storage class memory offers the prospect of large capacity persistent memory with DRAM-like access latency. In this work, we evaluate the performance of a small set of benchmarks using SCM as main memory. We use an FPGA emulator to model a range of memory latencies spanning DRAM to latency projected for SCM and beyond. Our work highlights the performance impact of higher latency and identifies conditions by which SCM can effectively be used as main memory.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127159987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}