2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献_第3页

A Burst Scheduling Access Reordering Mechanism 突发调度访问重排序机制

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346206

Jun Shao, B. Davis

{"title":"A Burst Scheduling Access Reordering Mechanism","authors":"Jun Shao, B. Davis","doi":"10.1109/HPCA.2007.346206","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346206","url":null,"abstract":"Utilizing the nonuniform latencies of SDRAM devices, access reordering mechanisms alter the sequence of main memory access streams to reduce the observed access latency. Using a revised M5 simulator with an accurate SDRAM module, the burst scheduling access reordering mechanism is proposed and compared to conventional in order memory scheduling as well as existing academic and industrial access reordering mechanisms. With burst scheduling, memory accesses to the same rows of the same banks are clustered into bursts to maximize bus utilization of the SDRAM device. Subject to a static threshold, memory reads are allowed to preempt ongoing writes for reduced read latency, while qualified writes are piggybacked at the end of bursts to exploit row locality in writes and prevent write queue saturation. Performance improvements contributed by read preemption and write piggybacking are identified. Simulation results show that burst scheduling reduces the average execution time of selected SPEC CPU2000 benchmarks by 21% over conventional bank in order memory scheduling. Burst scheduling also outperforms Intel's patented out of order memory scheduling and the row hit access reordering mechanism by 11% and 6% respectively","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126691079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

Interconnect-Centric Computing Interconnect-Centric计算

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346179

W. Dally

引用次数: 6

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors 芯片多处理器的自适应共享/私有NUCA缓存分区方案

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346180

H. Dybdahl, P. Stenström

引用次数: 140

A Scalable, Non-blocking Approach to Transactional Memory 事务性内存的可伸缩、非阻塞方法

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346189

Hassan Chafi, J. Casper, Brian D. Carlstrom, Austen McDonald, C. Minh, Woongki Baek, C. Kozyrakis, K. Olukotun

{"title":"A Scalable, Non-blocking Approach to Transactional Memory","authors":"Hassan Chafi, J. Casper, Brian D. Carlstrom, Austen McDonald, C. Minh, Woongki Baek, C. Kozyrakis, K. Olukotun","doi":"10.1109/HPCA.2007.346189","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346189","url":null,"abstract":"Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116096635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 149

Implications of Device Timing Variability on Full Chip Timing 器件时序可变性对全芯片时序的影响

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1145/1353629.1353644

M. Annavaram, Edward T. Grochowski, P. Reed

引用次数: 10

Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping 液态SIMD:使用轻量级动态映射抽象SIMD硬件

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346199

Nathan Clark, Amir Hormati, S. Yehia, S. Mahlke, K. Flautner

{"title":"Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping","authors":"Nathan Clark, Amir Hormati, S. Yehia, S. Mahlke, K. Flautner","doi":"10.1109/HPCA.2007.346199","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346199","url":null,"abstract":"Microprocessor designers commonly utilize SIMD accelerators and their associated instruction set extensions to provide substantial performance gains at a relatively low cost for media applications. One of the most difficult problems with using SIMD accelerators is forward migration to newer generations. With larger hardware budgets and more demands for performance, SIMD accelerators evolve with both larger data widths and increased functionality with each new generation. However, this causes difficult problems in terms of binary compatibility, software migration costs, and expensive redesign of the instruction set architecture. In this work, we propose Liquid SIMD to decouple the instruction set architecture from the SIMD accelerator. SIMD instructions are expressed using a processor's baseline scalar instruction set, and light-weight dynamic translation maps the representation onto a broad family of SIMD accelerators. Liquid SIMD effectively bypasses the problems inherent to instruction set modification and binary compatibility across accelerator generations. We provide a detailed description of changes to a compilation framework and processor pipeline needed to support this abstraction. Additionally, we show that the hardware overhead of dynamic optimization is modest, hardware changes do not affect cycle time of the processor, and the performance impact of abstracting the SIMD accelerator is negligible. We conclude that using dynamic techniques to map instructions onto SIMD accelerators is an effective way to improve computation efficiency, without the overhead associated with modifying the instruction set","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126194875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging MemTracker:高效和可编程的内存访问监控和调试支持

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346205

Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulović

{"title":"MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging","authors":"Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulović","doi":"10.1109/HPCA.2007.346205","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346205","url":null,"abstract":"Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources inefficiently. This paper describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. MemTracker's rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7% on average and 4.8% worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTracker-enabled checkers even in production runs","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116362172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Application-Level Correctness and its Impact on Fault Tolerance 应用层正确性及其对容错的影响

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346196

Xuanhua Li, D. Yeung

{"title":"Application-Level Correctness and its Impact on Fault Tolerance","authors":"Xuanhua Li, D. Yeung","doi":"10.1109/HPCA.2007.346196","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346196","url":null,"abstract":"Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user's perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which correctness is evaluated, with more faults being benign at higher levels of abstraction, i.e. at the user or application level, compared to lower levels of abstraction, i.e. at the architecture level. The extent to which programs are more fault resilient at higher levels of abstraction is application dependent. Programs that produce inexact and/or approximate outputs can be very resilient at the application level. We call such programs soft computations, and we find they are common in multimedia workloads, as well as artificial intelligence (AI) workloads. Programs that compute exact numerical outputs offer less error resilience at the application level. However, we find all programs studied in this paper exhibit some enhanced fault resilience at the application level, including those that are traditionally considered exact computations - e.g., SPECInt CPU2000. This paper investigates definitions of program correctness that view correctness from the application's standpoint rather than the architecture's standpoint. Under application-level correctness, a program's execution is deemed correct as long as the result it produces is acceptable to the user. To quantify user satisfaction, we rely on application-level fidelity metrics that capture user-perceived program solution quality. We conduct a detailed fault susceptibility study that measures how much more fault resilient programs are when defining correctness at the application level compared to the architecture level. Our results show for 6 multimedia and AI benchmarks that 45.8% of architecturally incorrect faults are correct at the application level. For 3 SPECInt CPU2000 benchmarks, 17.6% of architecturally incorrect faults are correct at the application level. We also present a lightweight fault recovery mechanism that exploits the relaxed requirements on numerical integrity provided by application-level correctness to reduce checkpoint cost. Our lightweight fault recovery mechanism successfully recovers 66.3% of program crashes in our multimedia and AI workloads, while incurring minimum runtime overhead","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129799589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures CMP架构中一种低开销容错一致性协议

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346194

Ricardo Fernández Pascual, José M. García, M. Acacio, J. Duato

{"title":"A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures","authors":"Ricardo Fernández Pascual, José M. García, M. Acacio, J. Duato","doi":"10.1109/HPCA.2007.346194","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346194","url":null,"abstract":"It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against a similar protocol without fault tolerance (TOKENCMP). We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TOKENCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Illustrative Design Space Studies with Microarchitectural Regression Models 用微建筑回归模型研究说明性设计空间

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346211

Benjamin C. Lee, D. Brooks

{"title":"Illustrative Design Space Studies with Microarchitectural Regression Models","authors":"Benjamin C. Lee, D. Brooks","doi":"10.1109/HPCA.2007.346211","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346211","url":null,"abstract":"We apply a scalable approach for practical, comprehensive design space evaluation and optimization. This approach combines design space sampling and statistical inference to identify trends from a sparse simulation of the space. The computational efficiency of sampling and inference enables new capabilities in design space exploration. We illustrate these capabilities using performance and power models for three studies of a 260,000 point design space: (1) Pareto frontier analysis, (2) pipeline depth analysis, and (3) multiprocessor heterogeneity analysis. For each study, we provide an assessment of predictive error and sensitivity of observed trends to such error. We construct Pareto frontiers and find predictions for Pareto optima are no less accurate than those for the broader design space. We reproduce and enhance prior pipeline depth studies, demonstrating constrained sensitivity studies may not generalize when many other design parameters are held at constant values. Lastly, we identify efficient heterogeneous core designs by clustering per benchmark optimal architectures. Collectively, these studies motivate the application of techniques in statistical inference for more effective use of modern simulator infrastructure","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138