2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献

Fg-STP: Fine-Grain Single Thread Partitioning on Multicores Fg-STP:多核上的细粒度单线程分区

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749713

Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González

{"title":"Fg-STP: Fine-Grain Single Thread Partitioning on Multicores","authors":"Rakesh Ranjan, Fernando Latorre, P. Marcuello, Antonio González","doi":"10.1109/HPCA.2011.5749713","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749713","url":null,"abstract":"Power and complexity issues have led the microprocessor industry to shift to Chip Multiprocessors in order to be able to better utilize the additional transistors ensured by Moore's law. While parallel programs are going to be able to take most of the advantage of these CMPs, single thread applications are not equipped to benefit from them. In this paper we propose Fine-Grain Single-Thread Partitioning (Fg-STP), a hardware-only scheme that takes advantage of CMP designs to speedup single-threaded applications. Our proposal improves single thread performance by reconfiguring two cores with the aim of collaborating on the fetching and execution of the instructions. These cores are basically conventional out-of-order cores in which execution is orchestrated using a dedicated hardware that has minimum and localized impact on the original design of the cores. This approach partitions the code at instruction granularity and differs from previous proposals on the extensive use of dependence speculation, replication and communication. These features are combined with the ability to look for parallelism on large instruction windows without any software intervention (no re-compilation or profiling hints are needed). These characteristics allow Fg-STP to speedup single thread by 18% and 7% on average over similar hardware-only approaches like Core Fusion, on medium sized and small sized 2-core CMP respectively for Spec 2006 benchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115135427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Offline symbolic analysis to infer Total Store Order 离线符号分析，以推断总存储订单

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749743

Dongyoon Lee, Mahmoud H. Said, S. Narayanasamy, Z. Yang

引用次数: 22

Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols 原子相干性:利用纳米光子学构建无竞争缓存相干性协议

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749723

D. Vantrease, Mikko H. Lipasti, N. Binkert

引用次数: 43

Bloom Filter Guided Transaction Scheduling 布隆过滤器引导的事务调度

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749718

G. Blake, R. Dreslinski, T. Mudge

{"title":"Bloom Filter Guided Transaction Scheduling","authors":"G. Blake, R. Dreslinski, T. Mudge","doi":"10.1109/HPCA.2011.5749718","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749718","url":null,"abstract":"Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128975828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors 使用异构单元尺寸的高性能处理器的低压片上缓存架构

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749715

H. Ghasemi, S. Draper, N. Kim

{"title":"Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors","authors":"H. Ghasemi, S. Draper, N. Kim","doi":"10.1109/HPCA.2011.5749715","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749715","url":null,"abstract":"To date dynamic voltage/frequency scaling (DVFS) has been one of the most successful power-reduction techniques. However, ever-increasing process variability reduces the reliability of static random access memory (SRAM) at low voltages. This limits voltage scaling to a minimum operating voltage (VDDMIN). Larger SRAM cells, that are less sensitive to process variability, allow the use of lower VDDMIN. However, large-scale memory structures, e.g., the last-level cache (LLC) (that often determines the VDDMIN of the processor), cannot afford to use such large SRAM cells due to the die area constraint. In this paper we propose low-voltage LLC architectures that exploit 1) the DVFS characteristics of workloads running on high-performance processors, 2) the trade-off between SRAM cell size and VDDMIN, and 3) the fact that at lower voltage/frequency operating states the negative performance impact of having a smaller LLC capacity is reduced. Our proposed LLC architectures provide the same maximum performance and VDDMIN as the conventional architecture, while reducing the total LLC cell area by 15%–19% with negligible average runtime increase.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129182792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

SolarCore: Solar energy driven multi-core architecture power management SolarCore:太阳能驱动的多核架构电源管理

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749729

Chao Li, Wangyuan Zhang, Chang-Burm Cho, Tao Li

{"title":"SolarCore: Solar energy driven multi-core architecture power management","authors":"Chao Li, Wangyuan Zhang, Chang-Burm Cho, Tao Li","doi":"10.1109/HPCA.2011.5749729","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749729","url":null,"abstract":"The global energy crisis and environmental concerns (e.g. global warming) have driven the IT community into the green computing era. Of clean, renewable energy sources, solar power is the most promising. While efforts have been made to improve the performance-per-watt, conventional architecture power management schemes incur significant solar energy loss since they are largely workload-driven and unaware of the supply-side attributes. Existing solar power harvesting techniques improve the energy utilization but increase the environmental burden and capital investment due to the inclusion of large-scale batteries. Moreover, solar power harvesting itself cannot guarantee high performance without appropriate load adaptation. To this end, we propose SolarCore, a solar energy driven, multi-core architecture power management scheme that combines maximal power provisioning control and workload run-time optimization. Using real-world meteorological data across different geographic sites and seasons, we show that SolarCore is capable of achieving the optimal operation condition (e.g. maximal power point) of solar panels autonomously under various environmental conditions with a high green energy utilization of 82% on average. We propose efficient heuristics for allocating the time varying solar power across multiple cores and our algorithm can further improve the workload performance by 10.8% compared with that of round-robin adaptation, and at least 43% compared with that of conventional fixed-power budget control. This paper makes the first step on maximally reducing the carbon footprint of computing systems through the usage of renewable energy sources. We expect that the novel joint optimization techniques proposed in this paper will contribute to building a truly sustainable, high-performance computing environment.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128872057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

Dynamically Specialized Datapaths for energy efficient computing 用于节能计算的动态专用数据路径

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749755

Venkatraman Govindaraju, C. Ho, K. Sankaralingam

{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749755","url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 218

HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor HAQu:在芯片多处理器上为细粒度线程进行的硬件加速排队

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749720

Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck

{"title":"HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor","authors":"Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck","doi":"10.1109/HPCA.2011.5749720","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749720","url":null,"abstract":"Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"os-13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123387963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Calvin: Deterministic or not? Free will to choose Calvin:决定论还是不决定论?选择的自由意志

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749741

Derek Hower, P. Dudnik, M. Hill, D. Wood

{"title":"Calvin: Deterministic or not? Free will to choose","authors":"Derek Hower, P. Dudnik, M. Hill, D. Wood","doi":"10.1109/HPCA.2011.5749741","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749741","url":null,"abstract":"Most shared memory systems maximize performance by unpredictably resolving memory races. Unpredictable memory races can lead to nondeterminism in parallel programs, which can suffer from hard-to-reproduce hiesenbugs. We introduce Calvin, a shared memory model capable of executing in a conventional nondeterministic mode when performance is paramount and a deterministic mode when execution repeatability is important. Unlike prior hardware proposals for deterministic execution, Calvin exploits the flexibility of a memory consistency model weaker than sequential consistency. Specifically, Calvin logically orders memory operations into strata that are compatible with the Total Store Order (TSO). Calvin is also designed with the needs of future power-aware processors in mind, and does not require any speculation support. We develop a Calvin-MIST implementation that uses an unordered coalescing write cache, multiple-write coherence protocol, and delayed (timebomb) invalidations while maintaining TSO compatibility. Results show that Calvin-MIST can execute workloads in conventional mode at speeds comparable to a conventional system (providing compatibility) or execute deterministically for a modest average slowdown of less than 20% (when determinism is valued).","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129133333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

A case for guarded power gating for multi-core processors 一种用于多核处理器的保护电源门控的案例

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749737

Niti Madan, A. Buyuktosunoglu, P. Bose, M. Annavaram

引用次数: 75