Proceedings. International Symposium on Computer Architecture最新文献_第8页

Dynamic performance tuning for speculative threads 推测线程的动态性能调优

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555812

Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai, Nikhil Mungre, Ankit Tarkas

{"title":"Dynamic performance tuning for speculative threads","authors":"Yangchun Luo, Venkatesan Packirisamy, W. Hsu, Antonia Zhai, Nikhil Mungre, Ankit Tarkas","doi":"10.1145/1555754.1555812","DOIUrl":"https://doi.org/10.1145/1555754.1555812","url":null,"abstract":"In response to the emergence of multicore processors, various novel and sophisticated execution models have been introduced to fully utilize these processors. One such execution model is Thread-Level Speculation (TLS), which allows potentially dependent threads to execute speculatively in parallel. While TLS offers significant performance potential for applications that are otherwise non-parallel, extracting efficient speculative threads in the presence of complex control flow and ambiguous data dependences is a real challenge. This task is further complicated by the fact that the performance of speculative threads is often architecture-dependent, input-sensitive, and exhibits phase behaviors. Thus we propose dynamic performance tuning mechanisms that determine where and how to create speculative threads at runtime.\u0000 This paper describes the design, implementation, and evaluation of hardware and software support that takes advantage of runtime performance profiles to extract efficient speculative threads. In our proposed framework, speculative threads are monitored by hardware-based performance counters and their performance impact is estimated. The creation of speculative threads is adjusted based on the estimation. This paper proposes speculative threads performance estimation techniques, that are capable of correctly determining whether speculation can improve performance for loops that corresponds to 83.8% of total loop execution time across all benchmarks. This paper also examines several dynamic performance tuning policies and finds that the best tuning policy achieves an overall speedup of 36.8%on a set of benchmarks from SPEC2000 suite, which outperforms static thread management by 9.5%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"8 1","pages":"462-473"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85427150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Thread motion: fine-grained power management for multi-core systems 线程运动:多核系统的细粒度电源管理

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555793

K. Rangan, Gu-Yeon Wei, D. Brooks

引用次数: 266

Indirect adaptive routing on large scale interconnection networks 大规模互连网络中的间接自适应路由

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555783

Nan Jiang, John Kim, W. Dally

{"title":"Indirect adaptive routing on large scale interconnection networks","authors":"Nan Jiang, John Kim, W. Dally","doi":"10.1145/1555754.1555783","DOIUrl":"https://doi.org/10.1145/1555754.1555783","url":null,"abstract":"Recently proposed high-radix interconnection networks [10] require global adaptive routing to achieve optimum performance. Existing direct adaptive routing methods are slow to sense congestion remote from the source router and hence misroute many packets before such congestion is detected. This paper introduces indirect global adaptive routing (IAR) in which the adaptive routing decision uses information that is not directly available at the source router. We describe four IAR routing methods: credit round trip (CRT) [10], progressive adaptive routing (PAR), piggyback routing (PB), and reservation routing (RES). We evaluate each of these methods on the dragonfly topology under both steady-state and transient loads. Our results show that PB, PAR, and CRT all achieve good performance. PB provides the best absolute performance, with 2-7% lower latency on steady-state uniform random traffic at 70% load, while PAR provides the fastest response on transient loads. We also evaluate the implementation costs of the indirect adaptive routing methods and show that PB has the lowest implementation cost requiring <1% increase in the total storage of a typical high-radix router.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"71 1","pages":"220-231"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87740054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 119

Performance and power of cache-based reconfigurable computing 基于缓存的可重构计算的性能和能力

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555804

Andrew Putnam, S. Eggers, Dave Bennett, E. Dellinger, J. Mason, Henry Styles, P. Sundararajan, Ralph Wittig

{"title":"Performance and power of cache-based reconfigurable computing","authors":"Andrew Putnam, S. Eggers, Dave Bennett, E. Dellinger, J. Mason, Henry Styles, P. Sundararajan, Ralph Wittig","doi":"10.1145/1555754.1555804","DOIUrl":"https://doi.org/10.1145/1555754.1555804","url":null,"abstract":"Many-cache is a memory architecture that efficiently supports caching in commercially available FPGAs. It facilitates FPGA programming for high-performance computing (HPC) developers by providing them with memory performance that is greater and power consumption that is less than their current CPU platforms, but without sacrificing their familiar, C-based programming environment.\u0000 Many-cache creates multiple, multi-banked caches on top of an FGPA's small, independent memories, each targeting a particular data structure or region of memory in an application and each customized for the memory operations that access it. The caches are automatically generated from C source by the CHiMPS C-to-FPGA compiler.\u0000 This paper presents the analyses and optimizations of the CHiMPS compiler that construct many-cache caches. An architectural evaluation of CHiMPS-generated FPGAs demonstrates a performance advantage of 7.8x (geometric mean) over CPU-only execution of the same source code, FPGA power usage that is on average 4.1x less, and consequently performance per watt that is also greater, by a geometric mean of 21.3x.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"197 1","pages":"395-405"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74504682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization PC固态硬盘(ssd)的性能与带宽、并发性、设备架构和系统组织有关

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555790

Cagdas Dirik, B. Jacob

{"title":"The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization","authors":"Cagdas Dirik, B. Jacob","doi":"10.1145/1555754.1555790","DOIUrl":"https://doi.org/10.1145/1555754.1555790","url":null,"abstract":"As their prices decline, their storage capacities increase, and their endurance improves, NAND Flash Solid State Disks (SSD) provide an increasingly attractive alternative to Hard Disk Drives (HDD) for portable computing systems and PCs. This paper presents a study of NAND Flash SSD architectures and their management techniques, quantifying SSD performance under user-driven/PC applications in a multi-tasked environment; user activity represents typical PC workloads and includes browsing files and folders, emailing, text editing and document creation, surfing the web, listening to music and playing movies, editing large pictures, and running office applications.\u0000 We find the following: (a) the real limitation to NAND Flash memory performance is not its low per-device bandwidth but its internal core interface; (b) NAND Flash memory media transfer rates do not need to scale up to those of HDDs for good performance; (c) SSD organizations that exploit concurrency at both the system and device level (e.g. RAID-like organizations and Micron-style (superblocks) improve performance significantly; and (d) these system- and device-level concurrency mechanisms are, to a significant degree, orthogonal: that is, the performance increase due to one does not come at the expense of the other, as each exploits a different facet of concurrency exhibited within the PC workload.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"7 1","pages":"279-289"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73107166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 211

Temperature-constrained power control for chip multiprocessors with online model estimation 基于在线模型估计的芯片多处理器温度约束功率控制

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555794

Yefu Wang, Kai Ma, Xiaorui Wang

{"title":"Temperature-constrained power control for chip multiprocessors with online model estimation","authors":"Yefu Wang, Kai Ma, Xiaorui Wang","doi":"10.1145/1555754.1555794","DOIUrl":"https://doi.org/10.1145/1555754.1555794","url":null,"abstract":"As chip multiprocessors (CMP) become the main trend in processor development, various power and thermal management strategies have recently been proposed to optimize system performance while controlling the power or temperature of a CMP chip to stay below a constraint. The availability of per-core DVFS (dynamic voltage and frequency scaling) also makes it possible to develop advanced management strategies. However, most existing solutions rely on open-loop search or optimization with the assumption that power can be estimated accurately, while others adopt oversimplified feedback control strategies to control power and temperature separately, without any theoretical guarantees. In this paper, we propose a chip-level power control algorithm that is systematically designed based on optimal control theory. Our algorithm can precisely control the power of a CMP chip to the desired set point while maintaining the temperature of each core below a specified threshold. Furthermore, an online model estimator is designed to achieve analytical assurance of control accuracy and system stability, even in the face of significant workload variations or unpredictable chip or core variations. Empirical results on a physical testbed show that our controller outperforms two state-of-the-art control algorithms by having better SPEC benchmark performance and more precise power control. In addition, extensive simulation results demonstrate the efficacy of our algorithm for various CMP configurations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"30 1","pages":"314-324"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85538568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 217

Reactive NUCA: near-optimal block placement and replication in distributed caches 反应NUCA:在分布式缓存中接近最佳的块放置和复制

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555779

N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki

{"title":"Reactive NUCA: near-optimal block placement and replication in distributed caches","authors":"N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki","doi":"10.1145/1555754.1555779","DOIUrl":"https://doi.org/10.1145/1555754.1555779","url":null,"abstract":"Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.\u0000 In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"3 1","pages":"184-195"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73189765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 419

Boosting single-thread performance in multi-core systems through fine-grain multi-threading 通过细粒度多线程提高多核系统中的单线程性能

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555813

C. Madriles, P. López, J. M. Codina, E. Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez, Antonio González

{"title":"Boosting single-thread performance in multi-core systems through fine-grain multi-threading","authors":"C. Madriles, P. López, J. M. Codina, E. Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez, Antonio González","doi":"10.1145/1555754.1555813","DOIUrl":"https://doi.org/10.1145/1555754.1555813","url":null,"abstract":"Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahl's law.\u0000 In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations.\u0000 The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"12 1","pages":"474-483"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81735261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Achieving predictable performance through better memory controller placement in many-core CMPs 通过在多核cmp中更好的内存控制器位置实现可预测的性能

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555810

D. Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, Mikko H. Lipasti

引用次数: 161

Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices 去耦DIMM:使用低速DRAM器件构建高带宽存储系统

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555788

Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu

{"title":"Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices","authors":"Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu","doi":"10.1145/1555754.1555788","DOIUrl":"https://doi.org/10.1145/1555754.1555788","url":null,"abstract":"The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM.\u0000 Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"1 1","pages":"255-266"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78187217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61