{"title":"Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning","authors":"Mingli Xie, Dong Tong, Kan Huang, Xu Cheng","doi":"10.1109/HPCA.2014.6835945","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835945","url":null,"abstract":"Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads' requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads' memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115701172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MRPB: Memory request prioritization for massively parallel processors","authors":"Wenhao Jia, K. Shaw, M. Martonosi","doi":"10.1109/HPCA.2014.6835938","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835938","url":null,"abstract":"Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124982692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing TLB reach by exploiting clustering in page translations","authors":"B. Pham, A. Bhattacharjee, Yasuko Eckert, G. Loh","doi":"10.1109/HPCA.2014.6835964","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835964","url":null,"abstract":"The steadily increasing sizes of main memory capacities require corresponding increases in the processor's translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit “contiguous” spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce “adjacent” TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit “clustered” spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Precision-aware soft error protection for GPUs","authors":"David J. Palframan, N. Kim, Mikko H. Lipasti","doi":"10.1109/HPCA.2014.6835966","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835966","url":null,"abstract":"With the advent of general-purpose GPU computing, it is becoming increasingly desirable to protect GPUs from soft errors. For high computation throughout, GPUs must store a significant amount of state and have many execution units. The high power and area costs of full protection from soft errors make selective protection techniques attractive. Such approaches provide maximum error coverage within a fixed area or power limit, but typically treat all errors equally. We observe that for many floating-point-intensive GPGPU applications, small magnitude errors may have little effect on results, while large magnitude errors can be amplified to have a significant negative impact. We therefore propose a novel precision-aware protection approach for the GPU execution logic and register file to mitigate large magnitude errors. We also propose an architecture modification to optimize error coverage for integer computations. Our approach combines selective logic hardening, targeted checker circuits, and intelligent register file encoding for best error protection. We demonstrate that our approach can reduce the mean error magnitude by up to 87% compared to a traditional selective protection approach with the same overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"153 2-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114037896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomas Karnagel, R. Dementiev, Ravi Rajwar, K. Lai, T. Legler, B. Schlegel, Wolfgang Lehner
{"title":"Improving in-memory database index performance with Intel® Transactional Synchronization Extensions","authors":"Tomas Karnagel, R. Dementiev, Ravi Rajwar, K. Lai, T. Legler, B. Schlegel, Wolfgang Lehner","doi":"10.1109/HPCA.2014.6835957","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835957","url":null,"abstract":"The increasing number of cores every generation poses challenges for high-performance in-memory database systems. While these systems use sophisticated high-level algorithms to partition a query or run multiple queries in parallel, they also utilize low-level synchronization mechanisms to synchronize access to internal database data structures. Developers often spend significant development and verification effort to improve concurrency in the presence of such synchronization. The Intel® Transactional Synchronization Extensions (Intel® TSX) in the 4th Generation Core™ Processors enable hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. This paper evaluates the effectiveness of such hardware support in a commercial database. We focus on two index implementations: a B+Tree Index and the Delta Storage Index used in the SAP HANA® database system. We demonstrate that such support can improve performance of database data structures such as index trees and presents a compelling opportunity for the development of simpler, scalable, and easy-to-verify algorithms.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125598846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules","authors":"Aditya Agrawal, Amin Ansari, J. Torrellas","doi":"10.1109/HPCA.2014.6835978","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835978","url":null,"abstract":"EDRAM cells require periodic refresh, which ends up consuming substantial energy for large last-level caches. In practice, it is well known that different eDRAM cells can exhibit very different charge-retention properties. Unfortunately, current systems pessimistically assume worst-case retention times, and end up refreshing all the cells at a conservatively-high rate. In this paper, we propose an alternative approach. We use known facts about the factors that determine the retention properties of cells to build a new model of eDRAM retention times. The model is called Mosaic. The model shows that the retention times of cells in large eDRAM modules exhibit spatial correlation. Therefore, we logically divide the eDRAM module into regions or tiles, profile the retention properties of each tile, and program their refresh requirements in small counters in the cache controller. With this architecture, also called Mosaic, we refresh each tile at a different rate. The result is a 20x reduction in the number of refreshes in large eDRAM modules - practically eliminating refresh as a source of energy consumption.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131908896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing the cost of persistence for nonvolatile heaps in end user devices","authors":"Sudarsun Kannan, Ada Gavrilovska, K. Schwan","doi":"10.1109/HPCA.2014.6835960","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835960","url":null,"abstract":"This paper explores the performance implications of using future byte addressable non-volatile memory (NVM) like PCM in end client devices. We explore how to obtain dual benefits - increased capacity and faster persistence - with low overhead and cost. Specifically, while increasing memory capacity can be gained by treating NVM as virtual memory, its use of persistent data storage incurs high consistency (frequent cache flushes) and durability (logging for failure) overheads, referred to as `persistence cost'. These not only affect the applications causing them, but also other applications relying on the same cache and/or memory hierarchy. This paper analyzes and quantifies in detail the performance overheads of persistence, which include (1) the aforementioned cache interference as well as (2) memory allocator overheads, and finally, (3) durability costs due to logging. Novel solutions to overcome such overheads include (1) a page contiguity algorithm that reduces interference-related cache misses, (2) a cache efficient NVM write aware memory allocator that reduces cache line flushes of allocator state by 8X, and (3) hybrid logging that reduces durability overheads substantially. With these solutions, experimental evaluations with different end user applications and SPEC2006 benchmarks show up to 12% reductions in cache misses, thereby reducing the total number of NVM writes.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114607933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Emma, A. Buyuktosunoglu, Michael B. Healy, K. Kailas, Valentin Puente, R. Yu, A. Hartstein, P. Bose, J. Moreno, E. Kursun
{"title":"3D stacking of high-performance processors","authors":"P. Emma, A. Buyuktosunoglu, Michael B. Healy, K. Kailas, Valentin Puente, R. Yu, A. Hartstein, P. Bose, J. Moreno, E. Kursun","doi":"10.1109/HPCA.2014.6835959","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835959","url":null,"abstract":"In most 3D work to date, people have looked at two situations: 1) a case in which power density is not a problem, and the parts of a processor and/or entire processors can be stacked atop each other, and 2) a case in which power density is limited, and storage is stacked atop processors. In this paper, we consider the case in which power density is a limitation, yet we stack processors atop processors. We also will discuss some of the physical limitations today that render many of the good ideas presented in other work impractical, and what would be required in the technology to make them feasible. In the high-performance regime, circuits are not designed to be “power efficient;” they're designed to be fast. In power-efficient design, the speed and power of a processor should be nearly proportional. In the high-performance regime, the frequency is (ever progressingly) sublinear in power. Thus, when the power density is constrained - as it is in high-performance machines, there may be opportunities to selectively exploit parallelism in workloads by running processor-on-processor systems at the same power, yet at much greater than half speed.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129275475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Zhang, Jesse D. Bingham, John Erickson, Daniel J. Sorin
{"title":"PVCoherence: Designing flat coherence protocols for scalable verification","authors":"Meng Zhang, Jesse D. Bingham, John Erickson, Daniel J. Sorin","doi":"10.1109/MM.2015.48","DOIUrl":"https://doi.org/10.1109/MM.2015.48","url":null,"abstract":"The goal of this work is to design cache coherence protocols with many cores that can be verified with state-of-the-art automated verification methodologies. In particular, we focus on flat (non-hierarchical) coherence protocols, and we use a mostly-automated methodology based on parametric verification (PV). We propose several design guidelines that architects should follow if they want to design protocols that can be parametrically verified. We experimentally evaluate performance, storage overhead, and scalability of a protocol verified with PV compared to a highly optimized protocol that cannot be verified with PV.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126494621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers","authors":"D. DiTomaso, Avinash Karanth Kodi, A. Louri","doi":"10.1109/HPCA.2014.6835942","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835942","url":null,"abstract":"Network-on-Chips (NoCs) are quickly becoming the standard communication paradigm for the growing number of cores on the chip. While NoCs can deliver sufficient bandwidth and enhance scalability, NoCs suffer from high power consumption due to the router microarchitecture and communication channels that facilitate inter-core communication. As technology keeps scaling down in the nanometer regime, unpredictable device behavior due to aging, infant mortality, design defects, soft errors, aggressive design, and process-voltage-temperature variations, will increase and will result in a significant increase in faults (both permanent and transient) and hardware failures. In this paper, we propose QORE - a fault tolerant NoC architecture with Quad-Function Channel (QFC) buffers. The use of QFC buffers and their associated control (link and fault controllers) enhance fault-tolerance by allowing the NoC to dynamically adapt to faults at the link level and reverse propagation direction to avoid faulty links. Additionally, QFC buffers reduce router power and improve performance by eliminating in-router buffering. Our simulation results using real benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3× and throughput by 2.3× when compared to state-of-the art fault tolerant NoCs designs such as Ariadne and Vicis. Moreover, using Synopsys Design Compiler, we also show that network power in QORE is reduced by 21% with minimal control overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}