IEEE International Symposium on High-Performance Comp Architecture最新文献

Network within a network approach to create a scalable high-radix router microarchitecture 网络中的网络方法创建可扩展的高基数路由器微体系结构

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169048

Jung Ho Ahn, Sungwoo Choo, John Kim

{"title":"Network within a network approach to create a scalable high-radix router microarchitecture","authors":"Jung Ho Ahn, Sungwoo Choo, John Kim","doi":"10.1109/HPCA.2012.6169048","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169048","url":null,"abstract":"Cost-efficient networks are critical in creating scalable large-scale systems, including those found in supercomputers and datacenters. High-radix routers reduce network cost by lowering the network diameter while providing a high bisection bandwidth and path diversity. However, as the port count increases, the high-radix router microarchitecture needs to scale efficiently. Hierarchical crossbar organization has been proposed where a single large crossbar is partitioned into many small crossbars and overcomes the limitations of conventional switch microarchitecture. Although the organization provides high performance, its scalability is limited due to power and area overheads by the wires and intermediate buffers. We propose alternative scalable router microarchitectures that leverage a network within the switch design of the high-radix routers themselves. These designs lower the wiring complexity and buffer requirements. For example, when a folded-Clos switch is used instead of the hierarchical crossbar switch for a radix-64 router, it provides up to 73%, 58%, and 87% reduction in area, energy-delay product, and energy-delay-area product, respectively. We also explore more efficient switch designs by exploiting the traffic-pattern characteristics of the global network and its impact on the local network design within the switch. In particular, we propose a bilateral butterfly switch organization that has fewer crossbars and half the number of global wires compared to the topology-agnostic folded-Clos switch while achieving better low-load latency and equivalent saturation throughput.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121237441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Power balanced pipelines 功率平衡管路

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169032

J. Sartori, Ben Ahrens, Rakesh Kumar

{"title":"Power balanced pipelines","authors":"J. Sartori, Ben Ahrens, Rakesh Kumar","doi":"10.1109/HPCA.2012.6169032","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169032","url":null,"abstract":"Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective, as it maximizes instruction throughput. Unfortunately, this causes significant energy inefficiency in processors, as each microarchitectural pipeline stage gets the same amount of time to complete, irrespective of its size or complexity. For power-optimized processors, the inefficiency manifests itself as a significant imbalance in power consumption of different microarchitectural pipestages. In this paper, rather than balancing processor pipelines for delay, we propose the concept of power balanced pipelines - i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance. A specific implementation of the concept uses cycle time stealing [19] to deliberately redistribute cycle time from low-power pipeline stages to power-hungry stages, relaxing their timing constraints and allowing them to operate at reduced voltages or use smaller, less leaky cells. We present several static and dynamic techniques for power balancing and demonstrate that balancing pipeline power rather than delay can result in 46% processor power reduction with no loss in processor throughput for a full FabScalar processor over a power-optimized baseline. Benefits are comparable over a Fabscalar baseline where static cycle time stealing is used to optimize achieved frequency. Power savings increase at lower operating frequencies. To the best of our knowledge, this is the first such work on microarchitecture-level power reduction that guarantees the same performance.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"4 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114020324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs 协作分区:高性能cmp的高效缓存分区

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169036

Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, N. Topham, Björn Franke

{"title":"Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs","authors":"Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, N. Topham, Björn Franke","doi":"10.1109/HPCA.2012.6169036","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169036","url":null,"abstract":"Intelligently partitioning the last-level cache within a chip multiprocessor can bring significant performance improvements. Resources are given to the applications that can benefit most from them, restricting each core to a number of logical cache ways. However, although overall performance is increased, existing schemes fail to consider energy saving when making their partitioning decisions. This paper presents Cooperative Partitioning, a runtime partitioning scheme that reduces both dynamic and static energy while maintaining high performance. It works by enforcing cached data to be way-aligned, so that a way is owned by a single core at any time. Cores cooperate with each other to migrate ways between themselves after partitioning decisions have been made. Upon access to the cache, a core needs only to consult the ways that it owns to find its data, saving dynamic energy. Unused ways can be power-gated for static energy saving. We evaluate our approach on two-core and four-core systems, showing that we obtain average dynamic and static energy savings of 35% and 25% compared to a fixed partitioning scheme. In addition, Cooperative Partitioning maintains high performance while transferring ways five times faster than an existing state-of-the-art technique.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127940090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips 助推器:用于减轻低压芯片中工艺变化和应用不平衡的影响的无功核心加速

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6168942

Timothy N. Miller, Xiang Pan, Renji Thomas, N. Sedaghati, R. Teodorescu

{"title":"Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips","authors":"Timothy N. Miller, Xiang Pan, Renji Thomas, N. Sedaghati, R. Teodorescu","doi":"10.1109/HPCA.2012.6168942","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168942","url":null,"abstract":"Lowering supply voltage is one of the most effective techniques for reducing microprocessor power consumption. Unfortunately, at low voltages, chips are very sensitive to process variation, which can lead to large differences in the maximum frequency achieved by individual cores. This paper presents Booster, a simple, low-overhead framework for dynamically rebalancing performance heterogeneity caused by process variation and application imbalance. The Booster CMP includes two power supply rails set at two very low but different voltages. Each core can be dynamically assigned to either of the two rails using a gating circuit. This allows cores to quickly switch between two different frequencies. An on-chip governor controls the timing of the switching and the time spent on each rail. The governor manages a “boost budget” that dictates how many cores can be sped up (depending on the power constraints) at any given time. We present two implementations of Booster: Booster VAR, which virtually eliminates the effects of core-to-core frequency variation in near-threshold CMPs, and Booster SYNC, which additionally reduces the effects of imbalance in multithreaded applications. Evaluation using PARSEC and SPLASH2 benchmarks running on a simulated 32-core system shows an average performance improvement of 11% for Booster VAR and 23% for Booster SYNC.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

Architectural perspectives of future wireless base stations based on the IBM PowerEN™ processor 基于IBM PowerEN™处理器的未来无线基站的体系结构展望

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169045

Augusto J. Vega, P. Bose, A. Buyuktosunoglu, J. Derby, M. Franceschini, C. Johnson, R. Montoye

{"title":"Architectural perspectives of future wireless base stations based on the IBM PowerEN™ processor","authors":"Augusto J. Vega, P. Bose, A. Buyuktosunoglu, J. Derby, M. Franceschini, C. Johnson, R. Montoye","doi":"10.1109/HPCA.2012.6169045","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169045","url":null,"abstract":"In wireless networks, base stations are responsible for operating on large amounts of traffic at high speed rates. With the advent of new standards, as 4G, further pressure is put in the hardware requirements to satisfy speeds of up to 1 Gbps. In this work, we study the applicability and potential benefits of the IBM PowerEN processor (a multi-core, massively multithreaded platform) in the realm of base stations for the 3G and 4G standards. The approach involves exploiting the throughput computation capabilities of the PowerEN processor, replacing the bus-attached special-function accelerators with a layer of in-line universal acceleration support, incorporated within the cores. A key feature of this in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. To target a broad spectrum of applications for base stations, we also consider a PIR-based architecture built upon reconfigurable LCEs. In this paper, we evaluate the in-line universal accelerator and the PIR strategy focusing on two specific applications for base stations: FFT and Turbo Decoding.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134162961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip 全包转发:片上网络的全自适应路由算法的有效设计

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169049

Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang

{"title":"Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip","authors":"Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang","doi":"10.1109/HPCA.2012.6169049","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169049","url":null,"abstract":"Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock-avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC re-allocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133365980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Dynamically heterogeneous cores through 3D resource pooling 通过3D资源池动态异构内核

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169037

H. Homayoun, Vasileios Kontorinis, A. Shayan, Ta-Wei Lin, D. Tullsen

引用次数: 44

Pacman: Tolerating asymmetric data races with unintrusive hardware 《吃豆人》:容忍非对称数据竞争和非侵入性硬件

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169039

Shanxiang Qi, N. Otsuki, Lois Orosa Nogueira, A. Muzahid, J. Torrellas

引用次数: 17

Network congestion avoidance through Speculative Reservation 通过推测预约避免网络拥塞

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169047

Nan Jiang, Daniel U. Becker, George Michelogiannakis, W. Dally

引用次数: 58

JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers JETC:服务器中内存和CPU子系统的联合能源热和冷却管理

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-25 DOI: 10.1109/HPCA.2012.6169035

R. Ayoub, Rajib Nath, T. Simunic

{"title":"JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers","authors":"R. Ayoub, Rajib Nath, T. Simunic","doi":"10.1109/HPCA.2012.6169035","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169035","url":null,"abstract":"In this work we propose a joint energy, thermal and cooling management technique (JETC) that significantly reduces per server cooling and memory energy costs. Our analysis shows that decoupling the optimization of cooling energy of CPU & memory and the optimization of memory energy leads to suboptimal solutions due to thermal dependencies between CPU and memory and non-linearity in cooling energy. This motivates us to develop a holistic solution that integrates the energy, thermal and cooling management to maximize energy savings with negligible performance hit. JETC considers thermal and power states of CPU & memory, thermal coupling between them and fan speed to arrive at energy efficient decisions. It has CPU and memory actuators to implement its decisions. The memory actuator reduces the energy of memory by performing cooling aware clustering of memory pages to a subset of memory modules. The CPU actuator saves cooling energy by reducing the hot spots between and within the CPU sockets and minimizing the effects of thermal coupling. Our experimental results show that employing JETC results in 50.7% average energy reduction in cooling and memory subsystems with less than 0.3% performance overhead.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33