Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González
{"title":"Fg-STP: Fine-Grain Single Thread Partitioning on Multicores","authors":"Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González","doi":"10.1109/HPCA.2011.5749713","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749713","url":null,"abstract":"Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115135427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongyoon Lee, Mahmoud H. Said, S. Narayanasamy, Z. Yang
{"title":"Offline symbolic analysis to infer Total Store Order","authors":"Dongyoon Lee, Mahmoud H. Said, S. Narayanasamy, Z. Yang","doi":"10.1109/HPCA.2011.5749743","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749743","url":null,"abstract":"Ability to record and replay an execution can significantly help programmers debug their programs, especially parallel programs. De-terministically replaying a multiprocessor's execution under a relaxed memory model has remained a challenging problem. This is an important problem as most modern processors only support a relaxed memory model to enable many performance critical optimizations. The most common consistency model implemented in processors is the Total Store Order (TSO). We present an efficient and low-complexity processor based solution for recording and replaying under the Total Store Order (TSO) memory model. Processor provides support for logging data fetched on cache misses. Using this information each thread can be de-terministically replayed. A TSO-compliant casual order between the shared-memory accesses executed in different threads is then inferred using an offline algorithm based on Satisfiability Modulo Theory (SMT) solver. We also discuss methods to bound the search space during offline analysis and several optimizations to reduce the offline analysis time.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134352755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols","authors":"D. Vantrease, Mikko H. Lipasti, N. Binkert","doi":"10.1109/HPCA.2011.5749723","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749723","url":null,"abstract":"This paper advocates Atomic Coherence, a framework that simplifies cache coherence protocol specification, design, and verification by decoupling races from the protocol's operation. Atomic Coherence requires conflicting coherence requests to the same addresses be serialized with a mutex before they are issued. Once issued, requests follow a predictable race-free path. Because requests are guaranteed not to race, coherence protocols are simpler and protocol extensions are straightforward. Our implementation of Atomic Coherence uses optical mutexes because optics provides very low latency. We begin with a state-of-the-art non-atomic MOEFSI protocol and demonstrate that an atomic implementation is much simpler while imposing less than a 2% performance penalty. We then show how, in the absence of races, it is easy to add support for speculative coherence and improve performance by up to 70%. Similar performance gains may be possible in a non-atomic protocol, but not without considerable effort in race management.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124521531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bloom Filter Guided Transaction Scheduling","authors":"G. Blake, R. Dreslinski, T. Mudge","doi":"10.1109/HPCA.2011.5749718","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749718","url":null,"abstract":"Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128975828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors","authors":"H. Ghasemi, S. Draper, N. Kim","doi":"10.1109/HPCA.2011.5749715","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749715","url":null,"abstract":"To date dynamic voltage/frequency scaling (DVFS) has been one of the most successful power-reduction techniques. However, ever-increasing process variability reduces the reliability of static random access memory (SRAM) at low voltages. This limits voltage scaling to a minimum operating voltage (VDDMIN). Larger SRAM cells, that are less sensitive to process variability, allow the use of lower VDDMIN. However, large-scale memory structures, e.g., the last-level cache (LLC) (that often determines the VDDMIN of the processor), cannot afford to use such large SRAM cells due to the die area constraint. In this paper we propose low-voltage LLC architectures that exploit 1) the DVFS characteristics of workloads running on high-performance processors, 2) the trade-off between SRAM cell size and VDDMIN, and 3) the fact that at lower voltage/frequency operating states the negative performance impact of having a smaller LLC capacity is reduced. Our proposed LLC architectures provide the same maximum performance and VDDMIN as the conventional architecture, while reducing the total LLC cell area by 15%–19% with negligible average runtime increase.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129182792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SolarCore: Solar energy driven multi-core architecture power management","authors":"Chao Li, Wangyuan Zhang, Chang-Burm Cho, Tao Li","doi":"10.1109/HPCA.2011.5749729","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749729","url":null,"abstract":"The global energy crisis and environmental concerns (e.g. global warming) have driven the IT community into the green computing era. Of clean, renewable energy sources, solar power is the most promising. While efforts have been made to improve the performance-per-watt, conventional architecture power management schemes incur significant solar energy loss since they are largely workload-driven and unaware of the supply-side attributes. Existing solar power harvesting techniques improve the energy utilization but increase the environmental burden and capital investment due to the inclusion of large-scale batteries. Moreover, solar power harvesting itself cannot guarantee high performance without appropriate load adaptation. To this end, we propose SolarCore, a solar energy driven, multi-core architecture power management scheme that combines maximal power provisioning control and workload run-time optimization. Using real-world meteorological data across different geographic sites and seasons, we show that SolarCore is capable of achieving the optimal operation condition (e.g. maximal power point) of solar panels autonomously under various environmental conditions with a high green energy utilization of 82% on average. We propose efficient heuristics for allocating the time varying solar power across multiple cores and our algorithm can further improve the workload performance by 10.8% compared with that of round-robin adaptation, and at least 43% compared with that of conventional fixed-power budget control. This paper makes the first step on maximally reducing the carbon footprint of computing systems through the usage of renewable energy sources. We expect that the novel joint optimization techniques proposed in this paper will contribute to building a truly sustainable, high-performance computing environment.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749755","url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck
{"title":"HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor","authors":"Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck","doi":"10.1109/HPCA.2011.5749720","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749720","url":null,"abstract":"Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"os-13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123387963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calvin: Deterministic or not? Free will to choose","authors":"Derek Hower, P. Dudnik, M. Hill, D. Wood","doi":"10.1109/HPCA.2011.5749741","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749741","url":null,"abstract":"Most shared memory systems maximize performance by unpredictably resolving memory races. Unpredictable memory races can lead to nondeterminism in parallel programs, which can suffer from hard-to-reproduce hiesenbugs. We introduce Calvin, a shared memory model capable of executing in a conventional nondeterministic mode when performance is paramount and a deterministic mode when execution repeatability is important. Unlike prior hardware proposals for deterministic execution, Calvin exploits the flexibility of a memory consistency model weaker than sequential consistency. Specifically, Calvin logically orders memory operations into strata that are compatible with the Total Store Order (TSO). Calvin is also designed with the needs of future power-aware processors in mind, and does not require any speculation support. We develop a Calvin-MIST implementation that uses an unordered coalescing write cache, multiple-write coherence protocol, and delayed (timebomb) invalidations while maintaining TSO compatibility. Results show that Calvin-MIST can execute workloads in conventional mode at speeds comparable to a conventional system (providing compatibility) or execute deterministically for a modest average slowdown of less than 20% (when determinism is valued).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129133333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niti Madan, A. Buyuktosunoglu, P. Bose, M. Annavaram
{"title":"A case for guarded power gating for multi-core processors","authors":"Niti Madan, A. Buyuktosunoglu, P. Bose, M. Annavaram","doi":"10.1109/HPCA.2011.5749737","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749737","url":null,"abstract":"Dynamic power management has become an essential part of multi-core processors and associated systems. Dedicated controllers with embedded power management firmware are now an integral part of design in such multi-core server systems. Devising a robust power management policy that meets system-intended functionality across a diverse range of workloads remains a key challenge. One of the primary issues of concern in architecting a power management policy is that of performance degradation beyond a specified limit. A secondary issue is that of negative power savings. Guarding against such “holes” in the management policy is crucial in order to ensure successful deployment and use in real customer environments. It is also important to focus on developing new models and addressing the limitations of current modeling infrastructure, in analyzing alternate management policies during the design of modern multi-core systems. In this concept paper, we highlight the above specific challenges that are faced today by the server chip and system design industry in the area of power management.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114926747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}