{"title":"Network within a network approach to create a scalable high-radix router microarchitecture","authors":"Jung Ho Ahn, Sungwoo Choo, John Kim","doi":"10.1109/HPCA.2012.6169048","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169048","url":null,"abstract":"Cost-efficient networks are critical in creating scalable large-scale systems, including those found in supercomputers and datacenters. High-radix routers reduce network cost by lowering the network diameter while providing a high bisection bandwidth and path diversity. However, as the port count increases, the high-radix router microarchitecture needs to scale efficiently. Hierarchical crossbar organization has been proposed where a single large crossbar is partitioned into many small crossbars and overcomes the limitations of conventional switch microarchitecture. Although the organization provides high performance, its scalability is limited due to power and area overheads by the wires and intermediate buffers. We propose alternative scalable router microarchitectures that leverage a network within the switch design of the high-radix routers themselves. These designs lower the wiring complexity and buffer requirements. For example, when a folded-Clos switch is used instead of the hierarchical crossbar switch for a radix-64 router, it provides up to 73%, 58%, and 87% reduction in area, energy-delay product, and energy-delay-area product, respectively. We also explore more efficient switch designs by exploiting the traffic-pattern characteristics of the global network and its impact on the local network design within the switch. In particular, we propose a bilateral butterfly switch organization that has fewer crossbars and half the number of global wires compared to the topology-agnostic folded-Clos switch while achieving better low-load latency and equivalent saturation throughput.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121237441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power balanced pipelines","authors":"J. Sartori, Ben Ahrens, Rakesh Kumar","doi":"10.1109/HPCA.2012.6169032","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169032","url":null,"abstract":"Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective, as it maximizes instruction throughput. Unfortunately, this causes significant energy inefficiency in processors, as each microarchitectural pipeline stage gets the same amount of time to complete, irrespective of its size or complexity. For power-optimized processors, the inefficiency manifests itself as a significant imbalance in power consumption of different microarchitectural pipestages. In this paper, rather than balancing processor pipelines for delay, we propose the concept of power balanced pipelines - i.e., processor pipelines in which different delays are assigned to different microarchitectural pipestages to reduce the power disparity between the stages while guaranteeing the same processor frequency/performance. A specific implementation of the concept uses cycle time stealing [19] to deliberately redistribute cycle time from low-power pipeline stages to power-hungry stages, relaxing their timing constraints and allowing them to operate at reduced voltages or use smaller, less leaky cells. We present several static and dynamic techniques for power balancing and demonstrate that balancing pipeline power rather than delay can result in 46% processor power reduction with no loss in processor throughput for a full FabScalar processor over a power-optimized baseline. Benefits are comparable over a Fabscalar baseline where static cycle time stealing is used to optimize achieved frequency. Power savings increase at lower operating frequencies. To the best of our knowledge, this is the first such work on microarchitecture-level power reduction that guarantees the same performance.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"4 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114020324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, N. Topham, Björn Franke
{"title":"Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs","authors":"Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, N. Topham, Björn Franke","doi":"10.1109/HPCA.2012.6169036","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169036","url":null,"abstract":"Intelligently partitioning the last-level cache within a chip multiprocessor can bring significant performance improvements. Resources are given to the applications that can benefit most from them, restricting each core to a number of logical cache ways. However, although overall performance is increased, existing schemes fail to consider energy saving when making their partitioning decisions. This paper presents Cooperative Partitioning, a runtime partitioning scheme that reduces both dynamic and static energy while maintaining high performance. It works by enforcing cached data to be way-aligned, so that a way is owned by a single core at any time. Cores cooperate with each other to migrate ways between themselves after partitioning decisions have been made. Upon access to the cache, a core needs only to consult the ways that it owns to find its data, saving dynamic energy. Unused ways can be power-gated for static energy saving. We evaluate our approach on two-core and four-core systems, showing that we obtain average dynamic and static energy savings of 35% and 25% compared to a fixed partitioning scheme. In addition, Cooperative Partitioning maintains high performance while transferring ways five times faster than an existing state-of-the-art technique.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127940090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothy N. Miller, Xiang Pan, Renji Thomas, N. Sedaghati, R. Teodorescu
{"title":"Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips","authors":"Timothy N. Miller, Xiang Pan, Renji Thomas, N. Sedaghati, R. Teodorescu","doi":"10.1109/HPCA.2012.6168942","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6168942","url":null,"abstract":"Lowering supply voltage is one of the most effective techniques for reducing microprocessor power consumption. Unfortunately, at low voltages, chips are very sensitive to process variation, which can lead to large differences in the maximum frequency achieved by individual cores. This paper presents Booster, a simple, low-overhead framework for dynamically rebalancing performance heterogeneity caused by process variation and application imbalance. The Booster CMP includes two power supply rails set at two very low but different voltages. Each core can be dynamically assigned to either of the two rails using a gating circuit. This allows cores to quickly switch between two different frequencies. An on-chip governor controls the timing of the switching and the time spent on each rail. The governor manages a “boost budget” that dictates how many cores can be sped up (depending on the power constraints) at any given time. We present two implementations of Booster: Booster VAR, which virtually eliminates the effects of core-to-core frequency variation in near-threshold CMPs, and Booster SYNC, which additionally reduces the effects of imbalance in multithreaded applications. Evaluation using PARSEC and SPLASH2 benchmarks running on a simulated 32-core system shows an average performance improvement of 11% for Booster VAR and 23% for Booster SYNC.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130033693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Augusto J. Vega, P. Bose, A. Buyuktosunoglu, J. Derby, M. Franceschini, C. Johnson, R. Montoye
{"title":"Architectural perspectives of future wireless base stations based on the IBM PowerEN™ processor","authors":"Augusto J. Vega, P. Bose, A. Buyuktosunoglu, J. Derby, M. Franceschini, C. Johnson, R. Montoye","doi":"10.1109/HPCA.2012.6169045","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169045","url":null,"abstract":"In wireless networks, base stations are responsible for operating on large amounts of traffic at high speed rates. With the advent of new standards, as 4G, further pressure is put in the hardware requirements to satisfy speeds of up to 1 Gbps. In this work, we study the applicability and potential benefits of the IBM PowerEN processor (a multi-core, massively multithreaded platform) in the realm of base stations for the 3G and 4G standards. The approach involves exploiting the throughput computation capabilities of the PowerEN processor, replacing the bus-attached special-function accelerators with a layer of in-line universal acceleration support, incorporated within the cores. A key feature of this in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. To target a broad spectrum of applications for base stations, we also consider a PIR-based architecture built upon reconfigurable LCEs. In this paper, we evaluate the in-line universal accelerator and the PIR strategy focusing on two specific applications for base stations: FFT and Turbo Decoding.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134162961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip","authors":"Sheng Ma, Natalie D. Enright Jerger, Zhiying Wang","doi":"10.1109/HPCA.2012.6169049","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169049","url":null,"abstract":"Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock-avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC re-allocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133365980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Homayoun, Vasileios Kontorinis, A. Shayan, Ta-Wei Lin, D. Tullsen
{"title":"Dynamically heterogeneous cores through 3D resource pooling","authors":"H. Homayoun, Vasileios Kontorinis, A. Shayan, Ta-Wei Lin, D. Tullsen","doi":"10.1109/HPCA.2012.6169037","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169037","url":null,"abstract":"This paper describes an architecture for a dynamically heterogeneous processor architecture leveraging 3D stacking technology. Unlike prior work in the 2D plane, the extra dimension makes it possible to share resources at a fine granularity between vertically stacked cores. As a result, each core can grow or shrink resources, as needed by the code running on the core. This architecture, therefore, enables runtime customization of cores at a fine granularity and enables efficient execution at both high and low levels of thread parallelism. This architecture achieves performance gains from 9-41%, depending on the number of executing threads, and gains significant advantage in energy efficiency of up to 43%.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127068135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanxiang Qi, N. Otsuki, Lois Orosa Nogueira, A. Muzahid, J. Torrellas
{"title":"Pacman: Tolerating asymmetric data races with unintrusive hardware","authors":"Shanxiang Qi, N. Otsuki, Lois Orosa Nogueira, A. Muzahid, J. Torrellas","doi":"10.1109/HPCA.2012.6169039","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169039","url":null,"abstract":"Data races are a major contributor to parallel software unreliability. A type of race that is both common and typically harmful is the Asymmetric data race. It occurs when at least one of the racing threads is inside a critical section. Current proposals that target them are software-based. They slow down execution and require significant compiler, operating system (OS), or application changes. This paper proposes the first scheme to tolerate asymmetric data races in production runs with negligible execution overhead. The scheme, called Pacman, exploits cache coherence hardware to temporarily protect the variables that a thread accesses in a critical section from other threads' requests. Unlike previous schemes, Pacman induces negligible slowdown, needs no support from the compiler or (in the baseline design) from the OS, and requires no application source code changes. In addition, its hardware is relatively unintrusive. We test Pacman with the SPLASH-2, PARSEC, Sphinx 3, and Apache codes, and discover two unreported asymmetric data races.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125349813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Jiang, Daniel U. Becker, George Michelogiannakis, W. Dally
{"title":"Network congestion avoidance through Speculative Reservation","authors":"Nan Jiang, Daniel U. Becker, George Michelogiannakis, W. Dally","doi":"10.1109/HPCA.2012.6169047","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169047","url":null,"abstract":"Congestion caused by hot-spot traffic can significantly degrade the performance of a computer network. In this study, we present the Speculative Reservation Protocol (SRP), a new network congestion control mechanism that relieves the effect of hot-spot traffic in high bandwidth, low latency, lossless computer networks. Compared to existing congestion control approaches like Explicit Congestion Notification (ECN), which react to network congestion through packet marking and rate throttling, SRP takes a proactive approach of congestion avoidance. Using a light-weight endpoint reservation scheme and speculative packet transmission, SRP avoids hot-spot congestion while incurring minimal overhead. Our simulation results show that SRP responds more rapidly to the onset of severe hot-spots than ECN and has a higher network throughput on bursty network traffic. SRP also performs comparably to networks without congestion control on benign traffic patterns by reducing the latency and throughput overhead commonly associated with reservation protocols.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121743539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers","authors":"R. Ayoub, Rajib Nath, T. Simunic","doi":"10.1109/HPCA.2012.6169035","DOIUrl":"https://doi.org/10.1109/HPCA.2012.6169035","url":null,"abstract":"In this work we propose a joint energy, thermal and cooling management technique (JETC) that significantly reduces per server cooling and memory energy costs. Our analysis shows that decoupling the optimization of cooling energy of CPU & memory and the optimization of memory energy leads to suboptimal solutions due to thermal dependencies between CPU and memory and non-linearity in cooling energy. This motivates us to develop a holistic solution that integrates the energy, thermal and cooling management to maximize energy savings with negligible performance hit. JETC considers thermal and power states of CPU & memory, thermal coupling between them and fan speed to arrive at energy efficient decisions. It has CPU and memory actuators to implement its decisions. The memory actuator reduces the energy of memory by performing cooling aware clustering of memory pages to a subset of memory modules. The CPU actuator saves cooling energy by reducing the hot spots between and within the CPU sockets and minimizing the effects of thermal coupling. Our experimental results show that employing JETC results in 50.7% average energy reduction in cooling and memory subsystems with less than 0.3% performance overhead.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}