Proceedings. International Symposium on Computer Architecture最新文献_第7页

Ten ways to waste a parallel computer 浪费并行计算机的十种方法

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555755

K. Yelick

{"title":"Ten ways to waste a parallel computer","authors":"K. Yelick","doi":"10.1145/1555754.1555755","DOIUrl":"https://doi.org/10.1145/1555754.1555755","url":null,"abstract":"As clock speed increases taper off and hardware designers struggle to scale parallelism within a chip, software developers and researchers must face the challenge of writing portable software with no clear architectural target. On the hardware side, energy considerations will dominate many of the design decisions, and will ultimately limit what systems and applications can be built. This is especially true at the high end, where the next major milestone of exascale computing will be unattainable without major improvements in efficiency.\u0000 Although hardware designers have long worried about the efficiency of their designs, especially for battery-operated devices, software developers in general have not. To illustrate this point, I will describe some of the top ways to waste time and therefore energy waiting for communication, synchronization, or interactions with users or other systems. Data movement, rather than computation, is the big consumer of energy, yet software often moves data up and down the memory hierarchy or across a network multiple times. At the same time, hardware designers need to take into account the constraints of the computational problems that will run on their systems, as a design that is poorly matched to the computational requirements will end up being inefficient. Drawing on my own experience in scientific computing, I will give examples of how to make the combination of hardware, algorithms and software more efficient, but also describe some of the challenges that are inherent in the application problems we want to solve. The community needs to take an integrated approach to the problem, and consider how much business or science can be done per Joule, rather than optimizing a particular component of the system in isolation. This will require rethinking the algorithms, programming models, and hardware in concert, and therefore an unprecedented level of collaboration and cooperation between hardware and software designers.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"13 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85265401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

InvisiFence: performance-transparent memory ordering in conventional multiprocessors InvisiFence:传统多处理器中性能透明的内存排序

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555785

Colin Blundell, Milo M. K. Martin, T. Wenisch

{"title":"InvisiFence: performance-transparent memory ordering in conventional multiprocessors","authors":"Colin Blundell, Milo M. K. Martin, T. Wenisch","doi":"10.1145/1555754.1555785","DOIUrl":"https://doi.org/10.1145/1555754.1555785","url":null,"abstract":"A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences. Several prior proposals reduce the performance penalty of strongly ordered models using post-retirement speculation, but these designs either (1) maintain speculative state at a per-store granularity, causing storage requirements to grow proportionally to speculation depth, or (2) employ distributed global commit arbitration using unconventional chunk-based invalidation mechanisms. In this paper we propose InvisiFence, an approach for implementing memory ordering based on post-retirement speculation that avoids these concerns. InvisiFence leverages minimalistic mechanisms for post-retirement speculation proposed in other contexts to (1) track speculative state efficiently at block-granularity with dedicated storage requirements independent of speculation depth, (2) provide fast commit by avoiding explicit commit arbitration, and (3) operate under a conventional invalidation-based cache coherence protocol. InvisiFence supports both modes of operation found in prior work: speculating only when necessary to minimize the risk of rollback-inducing violations or speculating continuously to decouple consistency enforcement from the processor core. Overall, InvisiFence requires approximately one kilobyte of additional state to transform a conventional multiprocessor into one that provides performance-transparent memory ordering, fences, and atomic operations.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"1 1","pages":"233-244"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87983626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Disaggregated memory for expansion and sharing in blade servers 在刀片服务器中用于扩展和共享的分解内存

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555789

Kevin T. Lim, Jichuan Chang, T. Mudge, Parthasarathy Ranganathan, S. Reinhardt, T. Wenisch

{"title":"Disaggregated memory for expansion and sharing in blade servers","authors":"Kevin T. Lim, Jichuan Chang, T. Mudge, Parthasarathy Ranganathan, S. Reinhardt, T. Wenisch","doi":"10.1145/1555754.1555789","DOIUrl":"https://doi.org/10.1145/1555754.1555789","url":null,"abstract":"Analysis of technology and application trends reveals a growing imbalance in the peak compute-to-memory-capacity ratio for future servers. At the same time, the fraction contributed by memory systems to total datacenter costs and power consumption during typical usage is increasing. In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block-a memory blade-that allows memory to be \"disaggregated\" across a system ensemble. This remote memory blade can be used for memory capacity expansion to improve performance and for sharing memory across servers to reduce provisioning and power costs. We use this memory blade building block to propose two new system architecture solutions-(1) page-swapped remote memory at the virtualization layer, and (2) block-access remote memory with support in the coherence hardware-that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, we demonstrate that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by our solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"190 6 1","pages":"267-278"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88565400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 430

Application-aware deadlock-free oblivious routing 应用程序感知的无死锁遗忘路由

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555782

M. Kinsy, Myong Hyon Cho, Tina Wen, G. Suh, Marten van Dijk, S. Devadas

引用次数: 85

Rigel: an architecture and scalable programming interface for a 1000-core accelerator Rigel:用于1000核加速器的架构和可扩展编程接口

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555774

J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel

{"title":"Rigel: an architecture and scalable programming interface for a 1000-core accelerator","authors":"J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel","doi":"10.1145/1555754.1555774","DOIUrl":"https://doi.org/10.1145/1555754.1555774","url":null,"abstract":"This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.\u0000 We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"63 1","pages":"140-151"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77674115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

Stream chaining: exploiting multiple levels of correlation in data prefetching 流链:在数据预取中利用多级相关性

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555767

Pedro Díaz, Marcelo H. Cintra

{"title":"Stream chaining: exploiting multiple levels of correlation in data prefetching","authors":"Pedro Díaz, Marcelo H. Cintra","doi":"10.1145/1555754.1555767","DOIUrl":"https://doi.org/10.1145/1555754.1555767","url":null,"abstract":"Data prefetching has long been an important technique to amortize the effects of the memory wall, and is likely to remain so in the current era of multi-core systems. Most prefetchers operate by identifying patterns and correlations in the miss address stream. Separating streams according to the memory access instruction that generates the misses is an effective way of filtering out spurious addresses from predictable streams. On the other hand, by localizing streams based on the memory access instructions, such prefetchers both lose the complete time sequence information of misses and can only issue prefetches for a single memory access instruction at a time.\u0000 This paper proposes a novel class of prefetchers based on the idea of linking various localized streams into predictable chains of missing memory access instructions such that the prefetcher can issue prefetches along multiple streams. In this way the prefetcher is not limited to prefetching deeply for a single missing memory access instruction but can instead adaptively prefetch for other memory access instructions closer in time.\u0000 Experimental results show that the proposed prefetcher consistently achieves better performance than a state-of-the-art prefetcher -- 10% on average, being only outperformed in very few cases and then by only 2%, and outperforming that prefetcher by as much as 55% -- while consuming the same amount of memory bandwidth.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"2009 1","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78670557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors 解耦存储完成/静默确定性重放:为CPR/CFP处理器启用可扩展的数据内存

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555786

Andrew D. Hilton, A. Roth

{"title":"Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors","authors":"Andrew D. Hilton, A. Roth","doi":"10.1145/1555754.1555786","DOIUrl":"https://doi.org/10.1145/1555754.1555786","url":null,"abstract":"CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations.\u0000 The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint.\u0000 We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"40 1","pages":"245-254"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73098122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Hardware support for WCET analysis of hard real-time multicore systems 硬实时多核系统的WCET分析的硬件支持

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555764

Marco Paolieri, E. Quiñones, F. Cazorla, G. Bernat, M. Valero

引用次数: 290

Internet-scale service infrastructure efficiency 互联网规模的服务基础设施效率

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555756

James R. Hamilton

{"title":"Internet-scale service infrastructure efficiency","authors":"James R. Hamilton","doi":"10.1145/1555754.1555756","DOIUrl":"https://doi.org/10.1145/1555754.1555756","url":null,"abstract":"High-scale cloud services provide economies of scale of five to ten over small-scale deployments, and are becoming a large part of both enterprise information processing and consumer services. Even very large enterprise IT deployments have quite different cost drivers and optimizations points from internet-scale services. The former are people-dominated from a cost perspective whereas internet-scale service costs are driven by server hardware and infrastructure with people costs fading into the noise at less than 10%.\u0000 In this talk we inventory where the infrastructure costs are in internet-scale services. We track power distribution from 115KV at the property line through all conversions into the data center tracking the losses to final delivery at semiconductor voltage levels. We track cooling and all the energy conversions from power dissipation through release to the environment outside of the building. Understanding where the costs and inefficiencies lie, we'll look more closely at cooling and overall mechanical system design, server hardware design, and software techniques including graceful degradation mode, power yield management, and resource consumption shaping.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"33 1","pages":"232"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75199000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

A memory system design framework: creating smart memories 一个记忆系统设计框架:创建智能记忆

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI: 10.1145/1555754.1555805

A. Firoozshahian, A. Solomatnikov, Ofer Shacham, Zain Asgar, S. Richardson, C. Kozyrakis, M. Horowitz

{"title":"A memory system design framework: creating smart memories","authors":"A. Firoozshahian, A. Solomatnikov, Ofer Shacham, Zain Asgar, S. Richardson, C. Kozyrakis, M. Horowitz","doi":"10.1145/1555754.1555805","DOIUrl":"https://doi.org/10.1145/1555754.1555805","url":null,"abstract":"As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing the cache and protocol controllers to support these memory systems is complex, and their concurrency and latency characteristics significantly affect the performance of any CMP. To address this problem, this paper presents a microarchitecture framework for cache and protocol controllers, which can aid in generating the RTL for new memory systems. The framework consists of three pipelined engines' request-tracking, state-manipulation, and data movement' which are programmed to implement a higher-level memory model. This approach simplifies the design and verification of CMP systems by decomposing the memory model into sequences of state and data manipulations. Moreover, implementing the framework itself produces a polymorphic memory system.\u0000 To validate the approach, we implemented a scalable, flexible CMP in silicon. The memory system was then programmed to support three disparate memory models' cache coherent shared memory, streams and transactional memory. Measured overheads of this approach seem promising. Our system generates controllers with performance overheads of less than 20% compared to an ideal controller with zero internal latency. Even the overhead of directly implementing a fully programmable controller was modest. While it did double the controller's area, the amortized effective area in the system grew by roughly 7%.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"65 1","pages":"406-417"},"PeriodicalIF":0.0,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78357433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23