2011 IEEE 17th International Symposium on High Performance Computer Architecture最新文献_第2页

Dynamically Specialized Datapaths for energy efficient computing 用于节能计算的动态专用数据路径

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749755

Venkatraman Govindaraju, C. Ho, K. Sankaralingam

{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749755","url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115636899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 218

HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor HAQu:在芯片多处理器上为细粒度线程进行的硬件加速排队

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749720

Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck

{"title":"HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor","authors":"Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck","doi":"10.1109/HPCA.2011.5749720","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749720","url":null,"abstract":"Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"os-13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123387963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Data-triggered threads: Eliminating redundant computation 数据触发线程:消除冗余计算

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749727

Hung-Wei Tseng, D. Tullsen

引用次数: 34

CHIPPER: A low-complexity bufferless deflection router chip:一种低复杂度的无缓冲偏转路由器

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749724

Chris Fallin, Chris Craik, O. Mutlu

{"title":"CHIPPER: A low-complexity bufferless deflection router","authors":"Chris Fallin, Chris Craik, O. Mutlu","doi":"10.1109/HPCA.2011.5749724","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749724","url":null,"abstract":"As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127545125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 224

CloudCache: Expanding and shrinking private caches CloudCache:扩展和收缩私有缓存

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749731

Hyunjin Lee, Sangyeun Cho, B. Childers

引用次数: 80

Relaxing non-volatility for fast and energy-efficient STT-RAM caches 为快速和节能的STT-RAM缓存放松非易失性

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749716

IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan

引用次数: 411

Power shifting in Thrifty Interconnection Network 节约型互联网络中的权力转移

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749725

Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang

{"title":"Power shifting in Thrifty Interconnection Network","authors":"Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang","doi":"10.1109/HPCA.2011.5749725","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749725","url":null,"abstract":"This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

MOPED: Orchestrating interprocess message data on CMPs 在cmp上编排进程间消息数据

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749721

Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun

{"title":"MOPED: Orchestrating interprocess message data on CMPs","authors":"Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun","doi":"10.1109/HPCA.2011.5749721","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749721","url":null,"abstract":"Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134314487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Keynote address II: How's the parallel computing revolution going? 主题演讲二:并行计算革命进展如何?

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749730

K. McKinley

引用次数: 0

Abstraction and microarchitecture scaling in early-stage power modeling 早期功率建模中的抽象和微架构缩放

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI: 10.1109/HPCA.2011.5749746

H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer

引用次数: 39